Recently, this server (just to remind you: an ancient quad Pentium Pro machine with SCSI storage and FPM DRAM) experienced a 1½ hour downtime due to a
KERNEL_STACK_INPAGE_ERROR bluescreen, stop code
0x00000077. Yeah yeah, I’m dreaming about running OpenBSD on XIN.at, but it’s still the same old Windows server system. Bites, but hard to give up and/or migrate certain pieces of software. In any case, what exactly does that mean? In essence, it means that the operating systems’ paged pool memory got corrupted. So, heh?
More clearly, either a DRAM error or a disk error, as not the entire paged pool needs to actually be paged to disk. The paged pool is swappable to disk, but not necessarily swapped to disk. So we need to dig a bit deeper. Since this server has 2-error correction and 3-error reporting capability for its memory due to IBM combining the parity FPM-DRAM with additional ECC chips, we can look for ECC/parity error reports in the servers’ system log. Also, disk errors should be pretty apparent in the log. And look what we’ve got here (The actual error messages are German even though the log is being displayed on a remote, English system – well, the server itself is running a German OS):
Actually, when grouping the error log by disk events, I get this:
8 unrecoverable dead sectors and 46 controller errors, starting from march 2015 and nothing before that date. Now the actual meaning of a “controller error” isn’t quite clear. In case of SCSI hardware like here, it could be many things. Starting from firmware issues over cabling problems all the way to wrong SCSI bus terminations. Judging from the sporadic nature and the limited time window of the error I guess it’s really failing electronics in the drive however. The problems started roughly 10 years after that drive was manufactured, and it’s an 68-pin 10.000rpm Seagate Cheetah drive with 36GB capacity by the way.
So yeah, march 2015. Now you’re gonna say “you fuck, you saw it coming a long time ago!!”, and yeah, what can I say, it’s true. I did. But you know, looking away while whistling some happy tune is just too damn easy sometimes.
So, what happened exactly? There are no system memory errors at all, and the last error that has been reported before the BSOD was a disk event id 11, controller error. Whether there was another URE (unrecoverable read error / dead sector) as well, I can’t say. But this happened exactly before the machine went down, so I guess it’s pretty clear: The NT kernel tried to read swapped kernel paged pool memory back from disk, and when the disk error corrupted that critical read operation (whether controller error or URE), the kernel space memory got corrupted in the process, in which case any kernel has to halt the operating system as safe operation can no longer be guaranteed.
So, in the next few weeks, I will have to shut the machine down again to replace the drive and restore from a system image to a known good disk. In the meantime I’ll get some properly tested drives and I’m also gonna test the few drives I have in stock myself to find a proper replacement in due time.
Thank god I have that remote KVM and power cycle capabilities, so that even a non-ACPI compliant machine like the XIN.at server can recover from severe errors like this one, no matter where in the world I am. Good thing I spent some cash on an expensive UPS unit with management capabilities and that KVM box…