From Brad Alexander on 30 May 1998
This isn't Linux-specific, but I'm having a problem and I'm hoping you can help me come up with a workaround that isn't going to cost a lot of money.
I have an Intel P-100 on an Amptron AM-7900 board with 64MB of EDO RAM (2 32MB sticks), a gob of hard drives (a 2.2MB Quantum Fireball IDE and a FutureDomain SCSI controller with a 420MB Conner, a 1GB Seagate, 1GB Micropolis and 1GB Quantum Empire), a Diamond Stealth 64 with 2MB DRAM, and a SoundBlaster 16 Plug'n'Pray.
I'm running a heavily modified RedHat 5.0 machine with an 800MB DOS partition on /dev/hda1 and a 200MB win95 partition on /dev/hda3 (Linux's /+/usr is on /dev/hda2).
I have been seeing system lockups for quite a while now. I noticed them when running xlock in random mode initially, then noticed that I was also starting to have problems with some of my dos apps, like Jane's Longbow and Duke Nukem locking up. Under Linux, I settled on using xlock in galaxy mode, and the lockups dropped to every couple of weeks. (Note that during this time, I upgraded memory from 4 8MB sticks to 2 32s.)
Everything went all right until I upgraded to RedHat 5.0, with XFree86 3.3.1. The lockups increased to about every 2 days. Once I upgraded to XFree86 3.3.2, they dropped back down to about once a week.
I'm basically using you as a sounding board to see if I might have missed something. I'm thinking its hardware, but where? The stealth? The lockups seem to occur during graphics app use, xlock, or the gimp. The motherboard? The chip? What can I start replacing without sinking a whole bunch of money into it?
Thanks in advance,
Well, the first thought would be to try a different video card. I don't have too much confidence that the problem is truly related to the video card's activity --- so it's just a diagnostics start.
To see if this really is related to graphics, boot up the system in text mode (don't run X, change your runlevel or initdefault to one of the non-xdm modes if necessary). Now you can run a couple of kernel builds on it (that's usually a pretty good stress test. Try 'make -j' to work it harder.
It would also be helpful to know what sort of lockup you're getting. It may be that you could still login via a serial port (using a null modem and a laptop or any other nearby computer or terminal). Do do this simply add a line like... to your /etc/inittab. This should allow you to use one of your serial lines to login. It is possible for the Linux X Windows system and console to be dead while the kernel and other processes are still up and running. Another test is to ping it from another system (if you have an ethernet LAN connected to this machine). Even if telnet doesn't work you want to ping it to see if the kernel is still responding.
t1:23:respawn:/sbin/agetty -L 38400,19200,9600,2400,1200 ttyS1 vt100
It's also probably worth trying the software watchdog timer code in the newer kernels. These allow you to configure a kernel module to emulate a hardware watchdog timer card. These WDT devices are basically a "dead man's switch" for your system. If the timer isn't periodically updated by the kernel (or by some other thread in the kernel, in the case of the emulated WDT) then the WDT triggers a system reset.
Obviously a software emulation of this isn't quite as reliable as a hardware WDT --- since a completely hung kernel will never get around to calling on that module's thread of execution. However, it isn't too unlikely that the hang is in some specific kernel thread and that some other thread continues to execute after other parts have died.
Frankly I'm not sure what the difference between the kernel watchdog emulation code and the boot "panic=" parameter. But that's definitely another thing to try (just add something like panic=60 to your lilo "append=" directive, or manually when you boot up your system). I guess that the difference would be that there may be some conditions under which the kernel could get into a comatose or unresponsive state without panic'ing (if it got tricked into some really long timeout wait or something). The panic= option forces the Linux kernel to reboot after a "panic" (a critical error condition detected by the kernel, usually a corrupted table that fails its consistency and integrity checks).
Normally the kernel would just display a "panic" message and sit there waiting for human intervention. These are very rare (other than the old "VFS kernel panic, unable to mount root" that occurs when you have your kernel misconfigured for your arrangement of hard drives --- or when you change the hardware setting of your disk drives without updating your kernel (with the 'rdev' command to set the root device flags) and/or without updating your LILO or LOADLIN commands (which are usually used to pass these flags to your kernel to over-ride the compiled in defaults).
Other than that common case I think I've only seen one or two Linux kernel panics in the last 6 years. I've only had about a half dozen unexplained system lockups over that period --- and that's on about fifty Linux machines that I've managed during various portions of that time. These lockups might have been panics in situations that were so bad the kernel couldn't even display an error message, there's no way to know).
I've only had to reboot unresponsive Linux boxes about a dozen or so times in all the years I've used it. This was only a problem in the late .99 and early 1.0x kernels when I was running a very busy FTP/Web server that was simply overloaded -- the TCP/IP stack would get so congested that the system would timeout between my login name and password --- at the console (I'd've loved a working SAK --- secure attention key back then). I was glad to see the major TCP/IP re-write in between 1.2 and 2.x.
I'm not trying to tout Linux' horn here --- (well, maybe a little). The point is that I don't get panics and lockups often enough to see how the panic= parameter and the softdog/watchdog code would work in those situations.
However, if you enabled the panic= and/or the softdog kernel option, you may see that the machine reboots without a minute or two after your lockup (wait for ten or fifteen). This tells you that some part of the kernel was still running (and that the hardware isn't completely wigged out).
Beyond that the things to do are to take out all non-essential hardware (the sound card would be a great choice --- and the SCSI card, since you mention that your Linux partitions are on the IDE drives. As with most technical computing issues the it eventually boils down to a matter of cost. You mentioned a couple of times how you don't want to spend money on solving this problem. Ultimately the time you spend fighting with it translates to money --- and you'll have to eventually ask what your time is worth.
(The deeper part of this question is that you may find that your home machine isn't worth the time or the money and you may content yourself to just use any machines that you encounter at work, or whatever. Strange as that sounds I've had friends who refuse to keep a computer around the house specifically because they "spend enough time with them at work" and feels that "home is for family time").
At the same time I don't recommend throwing replacement components at the problem without understanding the nature of the problem. However, it may be that the best solution is to replace the motherboard and/or the video card and/or the RAM.
Troubleshooting computers is difficult work. Whole books have been devoted to the subject (I like the Win L. Rosch Hardware Bible personally --- read it years ago and should probably get an updated copy). There are also parts of the process that can't be gained from any book --- that you must learn by experience and figure out through some combination of analysis and intuition. As our computers become more sophisticated the balance seems to lean more for the intuition.