The Answer Guy 34:
Automated Recovery from System Failures
"Linux Gazette...making Linux just a little more fun!"
Automated Recovery from System Failures
From anonymous on the
L.U.S.T List
on 2 Sep 1998
And there will be no human to manually
check on the partitions after a power failure.
What's wrong with e2sck? TTYL!
I was thinking about this recently and I came upon an intereseting
idea. (I think a friend of mine used the following trick in a
commercial product he built around Linux).
The trick is to install two root filesystems (preferably on different
drives -- possibly even on different controllers). One of them is the
"Rescue Root" the other is the "Production Root." You then configure
the "rescue root" partition as the default LILO device and modify the
shutdown sequence to over-ride that default with an /sbin/lilo -R
command.
If the system boots from the rescue root it is because the system
was booted irregularly. The standard shutdown sequence was not
run. That rescue root can then do various diagnostics on the
product root and other filesystems. If necessary it can newfs and
restore the full production environment (from another, normally unused,
directory partition or drive). The design of the rescue root is a
matter for some consideration and research.
Normally the system will boot into "production" mode. Periodically
it can mount the alternative root fs to do filesystem checks and/or
an extra filesystem to do backups (of changes to the configuration
files). You can ensure that these configuration backups are done
under a version control system so that degenerative sets of changes
can be automatically backed out in an orderly fashion.
If you combine this with a watchdog timer card and a set of appropriate
system monitoring daemons (which all talk to a dispatch that periodically
resets the watchdog timer), you should have a system that has about the
most bulletproof autorecovery as is possible on PC equipment.
I should note that I haven't prototyped such a system yet. I've
just thought of it. A friend of mine also suggested that we devise
a way to have another proximate system also doing monitoring
(possibly via a null modem). He says he knows how to make a special
cable which would plug into the guard dog's printer/parallel port
(guard dog is what I've been calling the hypothetical proximal
system) and would be run into the case of the system we're monitoring
where it would be fit over the reset pins. This, with a small driver
should be able to strobe the reset line.
(In fact I joked that we could create a really special cable that would
daisy chain to as many as eight other systems and allow independent
reboot of any of them).
In any event the monitor system would presumably monitor some/most
of the same things as the watchdog timer; so I don't know what
benefit it would ultimately offer (unless it was prepared to
do or initiate failover to another standby system).
Perhaps this idea might be of interest to the maintainer of the
High-Availability HOWTO (Harald Milz -- whom I've blind copied
on this message). It's not really "High Availability" but
"Automated Recovery" which might be sufficiently close for many
applications. (i.e. if a web, mail, dns, or ftp server's downtime
can be reduced from "mean hours per incident" to "mean minutes per
incident" most sysadmins still get lots of points).
Automated Recovery from System Failures
From R P Herrold on 04 Sep 1998
We build custom Linux solution boxen. In our Build outline, we
take this concept a step further in setting up a redhat system --
we carry a spare /boot partition:
(extract)
(base 5.0 install)
Part name size Cyl cume actual min
>==== ========== ==== ==== ==== ==========
1 /boot 20 ___ 20
2 root 30 ___ 50 23
(/bin ___ M)
(/lib ___ M) modules
(/root ___ M)
(/sbin ___ M)
3 swap 30 ___ 80
4 (extended)
5 /mnt/spare 30 ___ 110 1
... The minima in a 'stripped down' / [root] partition vary
depending on where
/lib, /var, and /usr end up -- of late, a lot of distributions'
packages feel a need to live in /bin or /sbin unnecessarily -- and
probably should be in the /usr tree ... Likewise, if a package
is NOT statically linked, one can end up with problems, if a
partition randomly decides to 'go south.'
I was thinking about this recently and I came upon an intereseting
idea. (I think a friend of mine used the following trick in a
commercial product he built around Linux).
... We use the 'trick' as well
The trick is to install two root filesystems (preferably on different
drives -- possibly even on different controllers). One of them is the
"Rescue Root" the other is the "Production Root." You then configure
the "rescue root" partition as the default LILO device and modify the
shutdown sequence to over-ride that default with an /sbin/lilo -R
command.
... carrying the full [root] partition
I should note that I haven't prototyped such a system yet. I've
just thought of it. A friend of mine also suggested that we devise
... It works, and can avoid the need to keep a live floppy drive
in a host which would otherwise require one for emergency purposes
... aiding in avoiding physical security issues
[ normally I remove sig blocks, but since he copyrighted
his... I guess I'll leave it in. Curious one should
post a copyright into open mailing lists, though.
-- Heather ]
.-- -... ---.. ... -.- -.--
Copyright (C) 1998 R P Herrold
herrold@usa.net NIC: RPH5 (US)
My words are not deathless prose,
but they are mine.
Owl River Company 614 - 221 - 0695
"The World is Open to Linux (tm)"
... Open Source LINUX solutions ...
info@owlriver.com
Copyright © 1998, James T. Dennis
Published in Linux Gazette Issue 34 November 1998