Jun 022016
 

KERNEL_STACK_INPAGE_ERROR logoRecently, this server (just to remind you: an ancient quad Pentium Pro machine with SCSI storage and FPM DRAM) experienced a 1½ hour downtime due to a KERNEL_STACK_INPAGE_ERROR bluescreen, stop code 0x00000077. Yeah yeah, I’m dreaming about running OpenBSD on XIN.at, but it’s still the same old Windows server system. Bites, but hard to give up and/or migrate certain pieces of software. In any case, what exactly does that mean? In essence, it means that the operating systems’ paged pool memory got corrupted. So, heh?

More clearly, either a DRAM error or a disk error, as not the entire paged pool needs to actually be paged to disk. The paged pool is swappable to disk, but not necessarily swapped to disk. So we need to dig a bit deeper. Since this server has 2-error correction and 3-error reporting capability for its memory due to IBM combining the parity FPM-DRAM with additional ECC chips, we can look for ECC/parity error reports in the servers’ system log. Also, disk errors should be pretty apparent in the log. And look what we’ve got here (The actual error messages are German even though the log is being displayed on a remote, English system – well, the server itself is running a German OS):

Actually, when grouping the error log by disk events, I get this:

54 disk errors in total - 8 of which were dead sectors

54 disk errors in total – 8 of which were medium errors – dead sectors

8 unrecoverable dead sectors and 46 controller errors, starting from march 2015 and nothing before that date. Now the actual meaning of a “controller error” isn’t quite clear. In case of SCSI hardware like here, it could be many things. Starting from firmware issues over cabling problems all the way to wrong SCSI bus terminations. Judging from the sporadic nature and the limited time window of the error I guess it’s really failing electronics in the drive however. The problems started roughly 10 years after that drive was manufactured, and it’s an 68-pin 10.000rpm Seagate Cheetah drive with 36GB capacity by the way.

So yeah, march 2015. Now you’re gonna say “you fuck, you saw it coming a long time ago!!”, and yeah, what can I say, it’s true. I did. But you know, looking away while whistling some happy tune is just too damn easy sometimes. :roll:

So, what happened exactly? There are no system memory errors at all, and the last error that has been reported before the BSOD was a disk event id 11, controller error. Whether there was another URE (unrecoverable read error / dead sector) as well, I can’t say. But this happened exactly before the machine went down, so I guess it’s pretty clear: The NT kernel tried to read swapped kernel paged pool memory back from disk, and when the disk error corrupted that critical read operation (whether controller error or URE), the kernel space memory got corrupted in the process, in which case any kernel has to halt the operating system as safe operation can no longer be guaranteed.

So, in the next few weeks, I will have to shut the machine down again to replace the drive and restore from a system image to a known good disk. In the meantime I’ll get some properly tested drives and I’m also gonna test the few drives I have in stock myself to find a proper replacement in due time.

Thank god I have that remote KVM and power cycle capabilities, so that even a non-ACPI compliant machine like the XIN.at server can recover from severe errors like this one, no matter where in the world I am. :) Good thing I spent some cash on an expensive UPS unit with management capabilities and that KVM box…

May 232013
 

MemoryTen logoThere is this one story that I wanted to share for quite some time now, but for some reason I always forgot about it. 2 years back I was playing around with this chinese-developed RISC processor, a Loongson 2F (“龙芯”), which was built into a Lemote Yeeloong 8089B netbook. So I was [setting up and patching Debian Linux Wheezy/sid for this very weird machine] back then, and at some point I got the idea of upgrading the 1GB of memory to 2GB, which – according to the memory controller specifications – was the absolute maximum. The RAM required was the tricky part though, the memory controller required DDR-II/667, but in a dense single-rank configuration, which is extremely rare.

On my quest for a 2GB of single-rank DDR-II/667 SODIMM I came across a german company that would do a build-to-order, but they required me to order several thousand modules, which was unacceptable, me only needing a single one. Later I found that huge US memory store called [MemoryTen], and they had (and still have!) [one fitting module] on their website, but it was completely out of stock back then. So I sent them an eMail, asking whether they would get more in the future.

A day later a man called Sal Scuderi replied and asked if I wanted Kingston instead, because it was cheaper etc. But I insisted on single-rank RAM, after which the man asked me to wait a bit as he would see what he could do.

Actually, a few days later he said, that his boss (actually the goddamn company president, how’s that even possible?!) would inspect their factory in Irvine, California a few days later, and that if I really wanted that module, they could order a single one built just in time, which the company president could then pick up from the manufacturing line after inspection and take back to their HQ on his flight back! I was stunned by this. Was this guy serious? I mean, this is one damn huge company, would they still have the sort of flexibility to do something like that?

So, overjoyed by this, I confirmed the order. It didn’t even cost extra! How could that be? Mister Scuderi wrote me again, as his big boss had returned, the requested module in his carry-on luggage. They sent it to me, and 5 days later I had that package from the US on my doorstep:

I checked that sticker that showed the manufacturing date, and indeed back in 2011, the module was absolutely brand new, just a few days old. So this was truly built-to-order? Hard to believe, but obviously true. Now, to my disappointment the module worked, but not in the actual target machine. While the memory controller supported the module, I didn’t take the PMON2000 firmware into account. And as some contacts in China confirmed after I did some research, the 1GB memory limit was hardcoded into the firmware. Bad luck, heh. In the end, I did not dare to try and build my own modified PMON2000, that was just far too hard and risky. So I left it as it was.

Still, that doesn’t belittle what Sal Scuderi and his boss from MemoryTen did to get me that RAM built! Amazing indeed. And I thought service and support like that (for just $50 total!) didn’t exist anymore. But it seems there are some good people in some good companies left. And I’d say, Sal and MemoryTen are some of them! See my final correspondence with Sal, you’ll see how taken aback I actually was: ;)

> Hello, Sal!

> Uhm. Are you crazy? My apologies for saying that, but you have to be, if you
> just let your factory build me a single module on my request? I know no
> company in the world (as of yet) which would do a build-to-order without
> the customer buying at least a few thousand DIMMs?!?

> But if you really just did that for a single customer buying a single
> specific module, you’ve earned my respect..

> Regards,
> Michael Lackner 

Hi Michael,

It’s a done deal, and yes we are crazy, but it’s a good crazy, we aim
to please, and I’m sure you will be a good sales person besides being
a good customer,

Sal

Not sure what he meant by “sales person” anymore, maybe I should reread the whole conversation, but yeah, you get the idea. :) Awesome stuff!