Aug 282016
 

KERNEL_DATA_INPAGE_ERROR logoHere is how a responsible system administrator should handle downtimes and replacements of faulty hardware: Give advance notice to all users and make sure to give everybody enough time to prepare for services going offline, if possible. Specify a precise time window which is as convenient as possible for most users. Also, explain the exact technical reasons in words as simple as possible.

How I handled the replacement of XINs’ system hard disk? See that nice blue logo on the top left side? KERNEL_DATA_INPAGE_ERROR, bugcheck code 0x0000007a. And [it isn’t the first of its kind either], last one was a KERNEL_STACK_INPAGE_ERROR, clearly disk related given that the disk had logged controller errors as well as unrecoverable dead sectors. And NO, that one wasn’t the first one too. :roll: So yeah, I rebooted the [monster], and decided that it’s too much of a pain in the ass to fix it and hoped (=told myself while in denial) that it would just live on happily ever after! Clearly in ignorance of the obvious problem, just so I could walk over to my workstation and continue to watch some Anime and have a few cold ones in peace…

So, my apologies for being lazy in a slightly dangerous way this time. Well, it’s not like there aren’t any system backups or anything, but still. In the end, it caused an unannounced and unplanned downtime 3½ hours long. This still shouldn’t hurt XINs’ >=99% yearly availability, but it clearly wasn’t the right way to deal with it either…

Well, it’s fixed now, because this time I got a bit nervous and pissed off as well. Thanks to [Umlüx], the XIN server is now running a factory-new HP/Compaq 15000rpm 68p LVD/SE SCSI drive, essentially a Seagate Cheetah 15k.3. As I am writing this the drive has only 2.9h of power on time accumulated. Pretty nice to find such pristine hardware!

Thanks do however also fly out to [Grindhavoc]German flag and [lommodore]German flag from [Voodooalert]German flag, who also kindly provided a few drives, of which some were quite usable. They’re in store now, for when the current HP drive starts behaving badly.

Now, let’s hope it was just the disk and no Controller / cabling problem on top of that, but it looks like this should be it for now. One less thing to worry about as well. ;)

Jun 022016
 

KERNEL_STACK_INPAGE_ERROR logoRecently, this server (just to remind you: an ancient quad Pentium Pro machine with SCSI storage and FPM DRAM) experienced a 1½ hour downtime due to a KERNEL_STACK_INPAGE_ERROR bluescreen, stop code 0x00000077. Yeah yeah, I’m dreaming about running OpenBSD on XIN.at, but it’s still the same old Windows server system. Bites, but hard to give up and/or migrate certain pieces of software. In any case, what exactly does that mean? In essence, it means that the operating systems’ paged pool memory got corrupted. So, heh?

More clearly, either a DRAM error or a disk error, as not the entire paged pool needs to actually be paged to disk. The paged pool is swappable to disk, but not necessarily swapped to disk. So we need to dig a bit deeper. Since this server has 2-error correction and 3-error reporting capability for its memory due to IBM combining the parity FPM-DRAM with additional ECC chips, we can look for ECC/parity error reports in the servers’ system log. Also, disk errors should be pretty apparent in the log. And look what we’ve got here (The actual error messages are German even though the log is being displayed on a remote, English system – well, the server itself is running a German OS):

Actually, when grouping the error log by disk events, I get this:

54 disk errors in total - 8 of which were dead sectors

54 disk errors in total – 8 of which were medium errors – dead sectors

8 unrecoverable dead sectors and 46 controller errors, starting from march 2015 and nothing before that date. Now the actual meaning of a “controller error” isn’t quite clear. In case of SCSI hardware like here, it could be many things. Starting from firmware issues over cabling problems all the way to wrong SCSI bus terminations. Judging from the sporadic nature and the limited time window of the error I guess it’s really failing electronics in the drive however. The problems started roughly 10 years after that drive was manufactured, and it’s an 68-pin 10.000rpm Seagate Cheetah drive with 36GB capacity by the way.

So yeah, march 2015. Now you’re gonna say “you fuck, you saw it coming a long time ago!!”, and yeah, what can I say, it’s true. I did. But you know, looking away while whistling some happy tune is just too damn easy sometimes. :roll:

So, what happened exactly? There are no system memory errors at all, and the last error that has been reported before the BSOD was a disk event id 11, controller error. Whether there was another URE (unrecoverable read error / dead sector) as well, I can’t say. But this happened exactly before the machine went down, so I guess it’s pretty clear: The NT kernel tried to read swapped kernel paged pool memory back from disk, and when the disk error corrupted that critical read operation (whether controller error or URE), the kernel space memory got corrupted in the process, in which case any kernel has to halt the operating system as safe operation can no longer be guaranteed.

So, in the next few weeks, I will have to shut the machine down again to replace the drive and restore from a system image to a known good disk. In the meantime I’ll get some properly tested drives and I’m also gonna test the few drives I have in stock myself to find a proper replacement in due time.

Thank god I have that remote KVM and power cycle capabilities, so that even a non-ACPI compliant machine like the XIN.at server can recover from severe errors like this one, no matter where in the world I am. :) Good thing I spent some cash on an expensive UPS unit with management capabilities and that KVM box…