Jun 022016

Recently, this server (just to remind you: an ancient quad Pentium Pro machine with SCSI storage and FPM DRAM) experienced a 1½ hour downtime due to a KERNEL_STACK_INPAGE_ERROR bluescreen, stop code 0x00000077. Yeah yeah, I’m dreaming about running OpenBSD on XIN.at, but it’s still the same old Windows server system. Bites, but hard to give up and/or migrate certain pieces of software. In any case, what exactly does that mean? In essence, it means that the operating systems’ paged pool memory got corrupted. So, heh?

More clearly, either a DRAM error or a disk error, as not the entire paged pool needs to actually be paged to disk. The paged pool is swappable to disk, but not necessarily swapped to disk. So we need to dig a bit deeper. Since this server has 2-error correction and 3-error reporting capability for its memory due to IBM combining the parity FPM-DRAM with additional ECC chips, we can look for ECC/parity error reports in the servers’ system log. Also, disk errors should be pretty apparent in the log. And look what we’ve got here (The actual error messages are German even though the log is being displayed on a remote, English system – well, the server itself is running a German OS):

Actually, when grouping the error log by disk events, I get this:

54 disk errors in total - 8 of which were dead sectors

54 disk errors in total – 8 of which were medium errors – dead sectors

8 unrecoverable dead sectors and 46 controller errors, starting from march 2015 and nothing before that date. Now the actual meaning of a “controller error” isn’t quite clear. In case of SCSI hardware like here, it could be many things. Starting from firmware issues over cabling problems all the way to wrong SCSI bus terminations. Judging from the sporadic nature and the limited time window of the error I guess it’s really failing electronics in the drive however. The problems started roughly 10 years after that drive was manufactured, and it’s an 68-pin 10.000rpm Seagate Cheetah drive with 36GB capacity by the way.

So yeah, march 2015. Now you’re gonna say “you fuck, you saw it coming a long time ago!!”, and yeah, what can I say, it’s true. I did. But you know, looking away while whistling some happy tune is just too damn easy sometimes. :roll:

So, what happened exactly? There are no system memory errors at all, and the last error that has been reported before the BSOD was a disk event id 11, controller error. Whether there was another URE (unrecoverable read error / dead sector) as well, I can’t say. But this happened exactly before the machine went down, so I guess it’s pretty clear: The NT kernel tried to read swapped kernel paged pool memory back from disk, and when the disk error corrupted that critical read operation (whether controller error or URE), the kernel space memory got corrupted in the process, in which case any kernel has to halt the operating system as safe operation can no longer be guaranteed.

So, in the next few weeks, I will have to shut the machine down again to replace the drive and restore from a system image to a known good disk. In the meantime I’ll get some properly tested drives and I’m also gonna test the few drives I have in stock myself to find a proper replacement in due time.

Thank god I have that remote KVM and power cycle capabilities, so that even a non-ACPI compliant machine like the XIN.at server can recover from severe errors like this one, no matter where in the world I am. :) Good thing I spent some cash on an expensive UPS unit with management capabilities and that KVM box…

CC BY-NC-SA 4.0 A hardware failure on XIN.at which has caused an OS kernel crash by The GAT at XIN.at is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

  6 Responses to “A hardware failure on XIN.at which has caused an OS kernel crash”

  1. Really FPM ram? I thought all Pentium Pro platforms were using EDO ram at least! Wow.

    • Hey Sjaak,

      Yes, it’s truly FPM. The servers’ 450GX chipset can accept both, but at the time, I couldn’t find any available unbuffered, 72-pin 128MB 60ns EDO-DRAM modules with parity chips. So I went with server memory from Hewlett Packard, which in this case is FPM. Not sure how much of a speed difference there really is though. If I could get my hands on 16×128MB EDO modules with those specs, I’d consider upgrading… But it’s not that easy, especially for an acceptable price.

      Also, in the meantime, several replacement hard drives have been found. Thanks to an old friend and also some guys from [Voodooalert]German flag, I got some decent ones. Also found a 10k here that has not even a year of power on hours accumulated, no errors! But the star of the show is that 15k drive from said old friend. It’s a sealed replacement drive by HP (=Seagate), only had 1½ hours on it, probably the factory tests / sector checks. :) Doesn’t get more perfect than this! And the others can stay in storage for the future.

      Now I only need to find the proper time slot to clone the old drive before everything really starts falling apart. I’d like to have as new a clone as possible after all, so that there’s no gap in the log data etc. :)

      • I believe 450GX treats EDO no differently from FPM. EDO is mostly but not 100% backward compatible with FPM AFAIK.

        • Hello Yuhong Bao,

          It’s strangely hard to find clear information about this stuff. Even the [chipset datasheet] doesn’t specify this.

          I remember systems (also stuff like extensible sound cards, not just mainboards), which were specifically not EDO compatible, yet still seemed to work just fine with EDO memory for whatever reason. I gotta say, I don’t really know the specifics of how both memory types work in detail, so I’d need to read some documentation to learn about it. For now I don’t have time for that however.

          Anyway, the original Netfinity 7000 server is a strong indication for EDO working, as it’s using the 450GX as well, but with 168p ECC EDO DRAM using a different memory riser board. That’s how I got the idea that it would “just work™”.

          • There is a Micron technote on this topic: http://web.archive.org/web/20030817233855/http://download.micron.com/pdf/technotes/DT40.pdf

            Another indication is that the Pentium Pro came after the Triton (430FX) chipset which added support for EDO.

            • Hey Bao (is Bao the given name? I always forget with Asian names),

              The 430FX is more modern than 450GX however. Generally, 450GX is a weird chipset, it doesn’t even have PCI 2.1. Heck, my ancient 486 machine based on an ASUS PCI/I486SP3G mainboard even has PCI 2.1! The 450GX however only features PCI 2.0. I know of no other chipset which has actually ever used the 2.0 standard.

              So that’s why I am always a bit skeptical when it comes to these things. My guess would be that it probably works, but won’t use the EDO functionality, as described in that Micron document. One thing strikes me as being strange however: Big OEMs like IBM or HP have used 168p ECC EDO DRAM in their servers. It sounds weird that those guys would use EDO if it isn’t also faster. I mean, something like “Our servers support EDO memory, it just doesn’t do anything better than FPM in our boxes” doesn’t sound as if it would sell well, you know? ;)

              I thought I’d try the Netfinity 7000 memory board (168p EDO) in my PC Server 704, because the systems are almost identical, so all the boards should fit. But I guess I’ll better leave it as it is.

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre lang="" line="" escaped="" cssfile="">