From Jim Kinney in the comp.unix.questions newsgroup on 22 Jul 1998
I am starting the research into the design and implementation of a 3 node cluster to provide high availability web, database, and support services to a computer based physics lab. As envisioned, the primary interface machine will be the web server. The database that provides the dynamic web pages will be on a separate machine. Some other processes that accept input from the web process and output to the database will be on the third machine.
Have you looked at the "High Availability HOWTO"?
There's also the common "round robin DNS" model --- which is already used by many service providers. It has its limitations --- but it's the first thing to try if the clients can be configured to gracefully retry transactions on failure.
There's also the MOSIX project which was developed under BSD/OS and is allegedly being ported to Linux. This provides for process migration (again, more of a performance clustering and load balancing feature set).
However, there is another concept called "checkpointing." You can think of this as having regular, transparent, non-terminal "core dumps" (snapshots) taken of each process (or process group). These are written to disk and can be reloaded and restarted at the point where they left off. I'm not aware of any projects to provide checkpointing to Linux (or checkpointing subsystems). (Obviously any application can do its own checkpointing in a non-transparent fashion --- roughly equivalent to the periodic automatic saves performed by 'emacs' and other editors).
I have a pointer to some miscellaneous notes on checkpointing:
The implication here is that you could create a hybrid checkpointing and process migration model that would provide high availability. In a client/server context this would probably only be suitable for situations where the communications protocols were very robust --- and it might still require some IP and/or MAC address assumption or some specialized routing tricks.
One such routing trick might be the IP NAT project. IP masquerading is one form of NAT (allowing many clients to masquerade as a single proxy system).
Another form of NAT is many-to-many. Let's say you connected two disconnected sites that both chose 10.1.*.* addresses for their use --- you could put a NAT router between them that would bidirectionally translate the 10.1.*.* to corresponding 172.16.*.* addresses. Thus the two sites would be able to interoperate over a broader range of protocols than would be the case for IP masqurading --- since the TCP/UDP ports would not be re-written --- each 10.1.*.* address corresponds on a one-to-one basis with a 172.16.*.* counterpart.
The one other form of NAT is one-to-many (or "load balancing"). This makes one simple router look like a server. In actuality that "server" is just dispatching the packets it receives to any one of the backend servers it chooses (statistically or based on metrics that they communicate amongst themselves, privately). Cisco has a product called "Local Director" that does exactly this. One of the experimental versions of the Linux IP NAT code also appeared to do this with some success. I don't know if any further work as progressed on these lines.
Yet another approach that might make sense is to provide for replication of the data (files) across servers and to use protocols that transparent select among available servers (mirrors). This sounds just like CODA.
A less sophisticated approach to replication is to use the rsync package to maintain some failover servers (mirrors) --- and require that writes all go to one active server.
So, I am open to suggestions, comments, info, links to sites, book titles, etc. I have proposed a one year development time for the whole cluster, with a single machine application prototype of the user visible/used portion by around the Jan 1999. I love my job!
Jim Kinney M.S.
Educational Technology Specialist
Department of Physics
Web, mail, DNS and a number of other Internet services are naturally robust. With DNS you normally list up to three servers per host (in the /etc/resolv.conf) and all of these will be checked before a name lookup will fail. With SMTP the client will try each of the hosts listed in the results of an MX query. Round robin DNS will force most clients to try multiple different IP addresses on failure most of the time.
However, the applications that really need HA (fail over) and clustering for performance are things like db servers.
Having two systems monitor and process something like a set of db transactions in parallel (one active the other "mimic'ing" the first but not returning results) would be very interesting. The "mimic" would attempt to maintain the same applications state as the server --- and would assume the server's IP and MAC (ethernet media access control) addresses on failure --- to then transparently continue the transaction processing that was going on.
You might prototype such a system using web and ftp (the FTP application is a more dramatic demonstration --- since a web server involves many short transactions and mostly operates in a "disconnected" fashion).
One approach might be to have a custom ethernet driver that can be instructed to throw all of its output into the bit bucket. Thus the mimic is normally silent, but following a failure on the server it does the address assumption and rips off the muzzle. I suspect you'd have to have another interface between the two servers, one which is dedicated to maintaining the same state between the server and the mimic.
(For example if the server get a collision or an error that wasn't sensed by the mimic -- or vice versa -- the two might get horribly out of sync when the upper layer protocols require a resend. With special drivers the two systems might resolve these discrepancies at the kernel/driver layer --- so that the applications will always get the same data on their sockets).
I really have no idea how much tweaking this would take and whether or not it's even feasible.
However, it seems that your intent is to provide failover that is transparent to the applications layer. So, the work obviously has to happen below that.
It is unclear whether you are primarily interested in deploying a set of servers for use by your Physics team or whether you are interested in doing research and development in the computer science.
In any event your project will probably involve a hybrid of several of these approaches:
It would be very interesting to see someone develop process migration and checkpointing features for Linux though there doesn't seem to be any active work going on now.
I'd also love to see an "Beowulf enabled" SQL dbserver (where a couple of failover capable "dispatchers" could farm out transactions to multiple clustered Linux boxes in some sensible manner). I'm not even sure if that's feasible --- but it sure would knock down the scaleability walls that I hear about from those dbadmins.