:On Thu, Aug 31, 2006 at 09:58:59AM -0700, Matthew Dillon wrote:
:: that 75% of the interest in our project has nothing to do with my
:: project goals but instead are directly associated with work being done
:: by our relatively small community. I truely appreciate that effort
:: because it allows me to focus on the part that is most near and dear
:: to my own heart.
:
:Big question: after all the work that will go into the clustering, other than
:scientific research, what will the average user be able to use such advanced
:capability for?
:
:Jonathon McKitrick
I held off answering because I became quite interested in what others
thought the clustering would be used for.
Lets take a big, big step back and look at what the clustering means
from a practical standpoint.
There are really two situations involved here. First, we certainly
can allow you to say 'hey, I am going to take down machine A for
maintainance', giving the kernel the time to migrate all
resources off of machine A.
But being able to flip the power switch on machine A without warning,
or otherwise have a machine fail unexpectedly, is another ball of wax
entirely. There are only a few ways to cope with such an event:
(1) Processes with inaccessible data are killed. High level programs
such as 'make' would have to be made aware of this possibility,
process the correct error code, and restart the killed children
(e.g. compiles and such).
In this scenario, only a few programs would have to be made aware
of this type of failure in order to reap large benefits from a
big cluster, such as the ability to do massively parallel
compiles or graphics or other restartable things.
(2) You take a snapshot every once in a while and if a process fails
on one machine you recover an earlier version of it on another
(including rolling back any file modifications that were made).
(3) You run the cpu context in tandem on multiple machines so if one
machine fails another can take over without a break. This is
really an extension of the rollback mechanism, but with additional
requirements and it is particularly difficult to accomplish with
a threaded program where there may be direct memory interactions
between threads.
Tandem operation is possible with non-threaded programs but all
I/O interactions would have to be synchronization points (and thus
performance would suffer). Threaded programs would have to be
aware of the tandem operation, or else we make writing to memory
a synchronization point too (and even then I am not convinced it
is possible to keep two wholely duplicate copies of the program
operating in tandem).
Needless to say, a fully redundant system is very, very complex. My
2-year goal is NOT to achieve #3. It is to achieve #1 and also have the
ability to say 'hey, I'm taking machine BLAH down for maintainance,
migrate all the running contexts and related resources off of it please'.
Achieving #2 or #3 in a fully transparent fashion is more like a
5-year project, and you would take a very large performance hit in
order to achieve it.
But lets consider #1... consider the things you actually might want to
accomplish with a cluster. Large simulations, huge builds, or simply
providing resources to other projects that want to do large simulations
or huge builds.
Only a few programs like 'make' or the window manager have to actually
be aware of the failure case in order to be able to restart the killed
programs and make a cluster useful to a very large class of work product.
Even programs like sendmail and other services can operate fairly well
in such an environment.
So what can the average user do ?
* The average user can support a third party project by providing
cpu, memory, and storage resources to that project.
(clearly there are security issues involved, but even so there is
a large class of problems that can be addressed).
* The average user wants to leverage the cpu and memory resources
of all his networked machines for things like builds (buildworld,
pkg builds, etc)... batch operations which can be restarted if a
failure occurs.
So, consider, the average user has his desktop, and most processes
are running locally, but he also has other machines and they tie
into a named cluster based on the desktop. The cluster would
'see' the desktop's filesystems but otherwise operate as a separate
system. The average user would then be able to login to the
'cluster' and run things that then take advantage of all the machine's
resources.
* The average user might be part of a large project that has access to
a cluster.
The average user would then be able to tie into the cluster, see the
cluster's resources, and do things using the cluster that he could
otherwise not do on his personal box.
Clearly there are security issues here as well, but there is nothing
preventing us from having a trusted cluster allow untrusted tie-ins
which are locked to a particular user id... where the goal is not
necessarily to prevent the cluster from being DOSed but to prevent
trusted data from being compromisable by an untrusted source.
* The average user might want to tie into storage reserved for him in
a third party cluster, for the purposes of doing backups or other
things.
I'm sure I can think of other things, but that's the jist from the
point of view of the 'average user'.
-Matt
Matthew Dillon
<dillon@xxxxxxxxxxxxx>