Nov 082016
 

G.SHDSL extender failure (logo)…and it wasn’t even my fault! Can you believe it?! Probably not if you know me, but it’s true nonetheless… Almost 4 days of downtime and we’re back up since just about 2½ hours or so. Given that I already had to do maintenance on the server once this year (replacing a bad hard drive and doing a thorough cleaning as well as dust filter installation), this has crushed the yearly 99%+ availability that I was so proud of. So for the first time since 2006, XIN.at failed to satisfy my personal requirement in that regard. Including the maintenance done on the server and several regular ISP maintenances on the G.SHDSL line, the full downtime should now amount to roughly 90 hours in 2016. If we assume a sum of 8760 hours per year, I’m now down to an availability of ~98.97%.

That value might get a bit worse though if my ISP decides to do another few rounds of maintenance on the DSLAMs in the automatic exchange hub.

So, how did this happen?

It all began when my RAID-6 started acting up, the one in my workstation though, not in the server. Ok, I know, that’s entirely unrelated, but still. It died no pretty death right there last Friday. And once again (this happened before!) it was not the disks to blame, neither the controller, nor the FBM, not even the hotplug bay that I suspected because all disk failures where happening in the same bay. It was the power cable extensions. Again. Even though they’re brand new! I mean, what the hell. At least I know now, that an Areca controller can force RAID-6 arrays to come back to life even if already completely failed with 3+ disks down. Nice one, Areca, I’ll have a cold one in your honor!

And when that RAID was back up, I wanted to pull up my rolling shutters a bit, just because. Which is when the belt ripped in half and the shutters went crashing down, damning me to darkness. Ok, after that I had a beer and just went to bed. Not my day. Next day I did some makeshift repairs on the shutters so they would at least be rolled all the way up and stay there. Having 0% daylight at 09:00am is pretty depressing after all. Ok, after that was done (it was Saturday now), I sat back down in my chair and thought: “Ok, let’s just read my emails…”.

And then my G.SHDSL extender burned up, sending me, my email client, my server and the rest of my digital existence offline…

And that’s when I just knew I had to get up, drive to the supermarket and get a TON of beer!

Seriously… There is bad luck and then there is…

Bad luck never comes alone!

When it rains, it pours, they say

So, the thing just went dark from one moment to the next! No fan, no LEDs, no nothing. At first I thought it might be its external power supply, some standard 12V DC unit. But I measured the voltage and it was perfectly fine. So the extender itself was obviously dead. Never seen such a thing happen with Paradyne/Zhone hardware, but what can you do. So here’s the new one (or maybe it’s refurbished, you never know with this stuff):

Paradyne/Zhone SNE2040G G.SHDSL network extender

Paradyne/Zhone SNE2040G G.SHDSL network extender (click to enlarge)

Now all that’s left is to send the defective unit back and that’s that. I hope I won’t see anything like that happen again… :( At least I got them on the phone on Saturday (business level support), but I only have the small service level agreement with my current contract, so I couldn’t get a technician on weekends. And I wasn’t available “on-site” (at home) on Monday, so the replacement unit had to be shipped via parcel service.

Oh, and neither the 3G fallback solution nor the large SLA (full 24/7 on-site support) will ever be agreed upon for XIN.at – too expensive at ~40€ a month. :( There is just so much money I can pour into a free server after all.

At least everything is back up now, so cheers! Prost!

Feb 062015
 

Network[1] Everybody hates servers going offline. Especially email servers. Or web servers. Or MY SERVER! Now I prepared for a lot of things with my home server, I prepared for power failures, storage failures, operating system kernel crashes, everything. I thought I can recover from almost any possible breakdown even remotely, all but one: My four bonded G.SHDSL lines all failing at once. Which is what just happened. After lots of calls and even a replacement Paradyne/Zhone SNE2040G-S network extender having been brought to me within the time allowed by my SLA, all four lines still remained dark.

Now, today the telecommunication company which is responsible for the national network fixed the issue in the local automatic exchange. I tried to find out what had happened exactly, but ran into walls there. My Internet provider UPC got no information feedback from the telecommunication company A1 either, or at least nothing besides “it’s been fixed at the digital exchange”. Plus, as I am not an A1 customer exactly, so they won’t answer me directly. The stack is: UPC (internet provider) <=> Kapsch (field technicians handling UPC branded Internet access hardware, via outsourcing by UPC) <=> A1 (field technicians regarding the whole telecommunications infrastructure), while UPC may also communicate with A1 directly to handle outages. Communication seems to be kept to a minimum though. :(

Bad thing is, for a “business class” line, an outage of almost two days or 47 hours is a bit extreme. In such a case, more efficient communication could easily fix it faster. But it is what it is, I guess. And now I have to send one of the two Paradyne/Zhone G.SHDSL extenders back to UPC, this little bugger here:

The actively cooled Zhone 2040 G.SHDSL extender

The actively cooled Zhone SNE2040G-S G.SHDSL extender (click to enlarge)

There is actually a HSDPA (3G) fallback option, which works by implementing an OSI layer 2 coupling between the G.SHDSL line and the 3G access, keeping all IP addresses and domains the same and the services reachable during times of complete DSL failure. But I won’t order that upgrade, because it’s a steep 39€ before tax per month, or 46.80€ after tax. That’s just too expensive on top of what that connection’s already draining from my wallet.

All in all, this greatly endangers my usual, self-imposed yearly service availability of >=99%. 47 hours is a lot after all. So to maintain 99%, the server cannot go offline for more than 3 days, 15 hours and 36 minutes per regular year, and now I already have 1 day and 23 hours on the clock, and it’s just the beginning of the year! Let’s hope it runs more smoothly for the rest of 2015.

[1] Logo image is © Kyle Wickert, Do You Really Understand The Applications Flowing Through Your Network?