Here is how a responsible system administrator should handle downtimes and replacements of faulty hardware: Give advance notice to all users and make sure to give everybody enough time to prepare for services going offline, if possible. Specify a precise time window which is as convenient as possible for most users. Also, explain the exact technical reasons in words as simple as possible.
How I handled the replacement of XINs’ system hard disk? See that nice blue logo on the top left side?
KERNEL_DATA_INPAGE_ERROR, bugcheck code
0x0000007a. And [it isn’t the first of its kind either], last one was a
KERNEL_STACK_INPAGE_ERROR, clearly disk related given that the disk had logged controller errors as well as unrecoverable dead sectors. And NO, that one wasn’t the first one too. So yeah, I rebooted the [monster], and decided that it’s too much of a pain in the ass to fix it and hoped (=told myself while in denial) that it would just live on happily ever after! Clearly in ignorance of the obvious problem, just so I could walk over to my workstation and continue to watch some Anime and have a few cold ones in peace…
So, my apologies for being lazy in a slightly dangerous way this time. Well, it’s not like there aren’t any system backups or anything, but still. In the end, it caused an unannounced and unplanned downtime 3½ hours long. This still shouldn’t hurt XINs’ >=99% yearly availability, but it clearly wasn’t the right way to deal with it either…
Well, it’s fixed now, because this time I got a bit nervous and pissed off as well. Thanks to [Umlüx], the XIN server is now running a factory-new HP/Compaq 15000rpm 68p LVD/SE SCSI drive, essentially a Seagate Cheetah 15k.3. As I am writing this the drive has only 2.9h of power on time accumulated. Pretty nice to find such pristine hardware!
Thanks do however also fly out to [Grindhavoc] and [lommodore] from [Voodooalert], who also kindly provided a few drives, of which some were quite usable. They’re in store now, for when the current HP drive starts behaving badly.
Now, let’s hope it was just the disk and no Controller / cabling problem on top of that, but it looks like this should be it for now. One less thing to worry about as well.