Nov 082016
 

G.SHDSL extender failure (logo)…and it wasn’t even my fault! Can you believe it?! Probably not if you know me, but it’s true nonetheless… Almost 4 days of downtime and we’re back up since just about 2½ hours or so. Given that I already had to do maintenance on the server once this year (replacing a bad hard drive and doing a thorough cleaning as well as dust filter installation), this has crushed the yearly 99%+ availability that I was so proud of. So for the first time since 2006, XIN.at failed to satisfy my personal requirement in that regard. Including the maintenance done on the server and several regular ISP maintenances on the G.SHDSL line, the full downtime should now amount to roughly 90 hours in 2016. If we assume a sum of 8760 hours per year, I’m now down to an availability of ~98.97%.

That value might get a bit worse though if my ISP decides to do another few rounds of maintenance on the DSLAMs in the automatic exchange hub.

So, how did this happen?

It all began when my RAID-6 started acting up, the one in my workstation though, not in the server. Ok, I know, that’s entirely unrelated, but still. It died no pretty death right there last Friday. And once again (this happened before!) it was not the disks to blame, neither the controller, nor the FBM, not even the hotplug bay that I suspected because all disk failures where happening in the same bay. It was the power cable extensions. Again. Even though they’re brand new! I mean, what the hell. At least I know now, that an Areca controller can force RAID-6 arrays to come back to life even if already completely failed with 3+ disks down. Nice one, Areca, I’ll have a cold one in your honor!

And when that RAID was back up, I wanted to pull up my rolling shutters a bit, just because. Which is when the belt ripped in half and the shutters went crashing down, damning me to darkness. Ok, after that I had a beer and just went to bed. Not my day. Next day I did some makeshift repairs on the shutters so they would at least be rolled all the way up and stay there. Having 0% daylight at 09:00am is pretty depressing after all. Ok, after that was done (it was Saturday now), I sat back down in my chair and thought: “Ok, let’s just read my emails…”.

And then my G.SHDSL extender burned up, sending me, my email client, my server and the rest of my digital existence offline…

And that’s when I just knew I had to get up, drive to the supermarket and get a TON of beer!

Seriously… There is bad luck and then there is…

Bad luck never comes alone!

When it rains, it pours, they say

So, the thing just went dark from one moment to the next! No fan, no LEDs, no nothing. At first I thought it might be its external power supply, some standard 12V DC unit. But I measured the voltage and it was perfectly fine. So the extender itself was obviously dead. Never seen such a thing happen with Paradyne/Zhone hardware, but what can you do. So here’s the new one (or maybe it’s refurbished, you never know with this stuff):

Paradyne/Zhone SNE2040G G.SHDSL network extender

Paradyne/Zhone SNE2040G G.SHDSL network extender (click to enlarge)

Now all that’s left is to send the defective unit back and that’s that. I hope I won’t see anything like that happen again… :( At least I got them on the phone on Saturday (business level support), but I only have the small service level agreement with my current contract, so I couldn’t get a technician on weekends. And I wasn’t available “on-site” (at home) on Monday, so the replacement unit had to be shipped via parcel service.

Oh, and neither the 3G fallback solution nor the large SLA (full 24/7 on-site support) will ever be agreed upon for XIN.at – too expensive at ~40€ a month. :( There is just so much money I can pour into a free server after all.

At least everything is back up now, so cheers! Prost!

Dec 202015
 

Taranis RAID-6 logoAnd here’s another minor update after [part 4½] of my RAID array progress log. Since I was thinking that weekly RAID verifications would be too much for an array this size (because I thought it would take too long), I set the Areca controller to scrub and error-check my disks at an interval of four weeks. Just a shame that the thing doesn’t feature a proper scheduler with a calendar and configurable starting times for this. All you can tell it is to “check it every n weeks”. In any case, the verify completed this night, running for a total of 29:07:29 (so: 29 hours) across those 12 × 6TB HGST Ultrastar disks, luckily with zero bad blocks detected. Would’ve been a bit early for unrecoverable read errors to occur anyway. ;)

So this amounts to a scrub speed just shy of 550MiB/s, which isn’t blazingly fast for this array, but it’s acceptable I think. The background process priority during this operation was set to “Low (20%)”, and there have been roughly 150GiB of I/O during the disk scrubbing. Most of that I/O was concentrated in one fast Blu-Ray demux, but some video encoders were also running, reading and writing small amounts of data all the time. I guess I can live with that result.

Ah yeah, I should also show you the missing benchmarks, but before that, here’s a more normal photograph of the final system (where “normal” means “not a nightshot”. It does NOT mean “in a proper colorspace”, cause the light sources were heavily mixed, so the colors suck once again! ;) ):

The Taranis system during daytime

The “Taranis” RAID-6 system during daytime

And here are the missing benchmarks on the finalized array in a normal state. Once again, this is an Areca ARC-1883ix-12 with 12 × HGST Ultrastar 7k6000 6TB SAS disks in RAID-6 at an aligned stripe block size of 64kiB. The controller is equipped with FBM-backed 2GiB of Reg. ECC DDR-III/1866 write-back cache, and each individual drive features 128MiB of write-through cache (I have no UPS unit for this machine, which is why the drive caches themselves aren’t configured for write-back). The controller is configured to read & discard parity data to reduce seeks and is thus tuned for maximum sequential read performance. The benchmarking software was HDTune 2.55 as well as HDTune Pro 5.00:

With those modern Ultrastars instead of the old Seagate Cheetah 15k drives, the only thing that turned out to be worse is the seeking time. Given that it’s 3.5″ 7200rpm platters vs. 2.5″ 15000rpm platters that’s only natural though. Sequential throughput is a different story though: At large enough block sizes we get more than 1GiB/s almost consistently, for both reads and writes. Again, I’d have loved to try 4k writes as well, but HDTune Pro would just crash when picking that block size, same as with the Cheetah drives. Anyhow, 4k performance is nice as well. I’d give you some ASSSD numbers, but it fails to even see the array at all.

What I’ve seen in some other reviews holds true here too though: The Ultrastars do seem to fluctuate partly when it comes to performance. We can see that for the 64kiB reads as well as the 512kiB and 1MiB writes. On average though, raw read and write performance is absolutely stellar, just like ATTO, HDTach and Everst/Aida64 tests have suggested before as well. That IBM 1.2GHz [PowerPC 476] dual core chip is truly a monster in comparison to what I’ve seen on older RAID-6 controllers.

I’ve compared this to my old 3ware 9650SE-8LPML (AMCC [PowerPC 405CR] @ 266MHz), to an Adaptec-built ICP Vortex 5085BR (Intel [XScale IOP333] @ 800MHz), both with 8 × 7200rpm SATA disks and even to a Hewlett Packard MSA2312fc SAN with 12 × 15000rpm SAS Cheetahs (AMD [Turion 64 MT-32] 1800MHz). All of them are simply blown out of the water in every way thinkable: Performance, manageability, and if I were to consider the MSA2312fc as a serious contender as well (it isn’t exactly meant as a simple local block device): Stability too. I couldn’t tell how often those freaking management controllers are crashing on that thing and have to be rebooted via SSH…

So this thing has been up for about 4 weeks now. Still looking good so far…

Summer will be interesting, with some massive heat and all. We’ll see it that’ll trigger the temperature alarms of the HDD bays…

Nov 202015
 

Taranis RAID-6 logoYeah, after [part 3] it should be “part 4”. The final stage. However, while I’d love to present my final ~55TiB RAID-6 to you, I cannot do so yet, because there were and probably are some severe issues with the setup, which I will talk about down below. So, since my level of trust for Seagate is rather low because of the failure rates reported by Backblaze and my own experiences at work as well as the experiences from some other administrators I know, their line of Enterprise disks was out of the game. Another option would’ve been Hitachis Helium-filled Ultrastar He8, but since the He6 was reportedly rather disastrous, I don’t really want to trust those drives either.

This Helium stuff is just so new and daring, that I don’t want to trust them to be the very base of a RAID array that’s supposed to last for many, many years just yet.

Ultimately, I decided to get myself 12 insanely expensive Hitachi Ultrastar 7K6000 disks, “The last in Air” as they call ’em themselves. That’s a classic 5-platter 10-head airfilled enterprise disk with 7200rpm rotational speed and 6TB of capacity. I got the SAS/12Gbps version which also boasts 128MiB of cache. Mechanically, that’s all the same old tech that I’ve already been using with my 8 × 1TB Deskstars and now 8 × 2TB Ultrastars, so it’s something I can trust. However, as I said, there were/are some very serious issues. Maybe you remember this image:

"Helios" RAID-6 array emergency migration

Old array to the left…

So my old RAID-6 based on a 3ware 9650SE-8LPML with 8 × 2TB Ultrastars is sitting on the table, while the new one has been plugged into the Chieftec 2131SAS bays and hooked up to the Areca ARC-1883ix-12. Both RAID systems are thus connected to the same host machine at the same time, making it a total of 20 drives. This is supposed to make data migration using rsync very convenient and easy.

The problem is that I didn’t have enough power connectors for this (12 × SATA for the old array, ODDs and SSD, 8 × 4P Molex for the SAS bays), so I settled for Y-adapters to hook up the new array. Then the trouble started. At first I thought it was the passive SAS bays to blame. But as I continued my tests, drives would behave slightly differently as I exchanged and rotated the Y cables. What I observed was some weird “jitter”, where the drive heads were audibly moving around were they shouldn’t have, and sometimes drives would stall for a moment as well.

Ultimately, the array ran into a massive failure during init at about 60%, and 4-5 drives successively failed, collecting tons of recoverable read AND write errors in their S.M.A.R.T. logs. Bleh… At least no unrecoverable ones, but still…

At this point I ripped out half of the Y cables and hooked two of the four bays up to a dedicated power supply (only two, because of a lack of plugs). It seems this greatly changed the behavior of the whole setup, stabilizing it significantly. Of course it’s too early to say anything for sure, because now I’m just at roughly 25% through the second initialization process. But if I’m right, then a few 1€ parts have successfully wrecked a ~8000€ RAID array, now that’s something, eh?

In any case, before getting my Ultrastars I also tried the system with some Seagate Cheetah 15k.6 and 15k.7 drives I managed to borrow at work, 300GB 15000rpm SAS pieces, just for some benchmarks. Since those showed more severe problems even than the Hitachis (probably because they’re more power hungy?), I went down to 11, then 8 drives. Some of the benches will also show sudden stalls. Yeah. That’s the power issue.

Well, it can still serve as a quick glance at the performance levels one can expect with the Areca ARC-1883ix-12, even in such a state. Let me just say: It is a nice feeling to see a RAID array based on mechanical drives push 1000-1200MiB/s over the bus on average, reading at 64kiB-1MiB block sizes. At least that part is undeniably awesome! Here are a few screenshots for you, RAID stripe block sizes are always 64kiB, read block sizes are 4kiB, 64kiB and 1MiB, write block sizes are 64kiB, 512kiB and 1MiB. For the RAID-6 setup there are also benches during init and in 2-disk degraded mode, software’s just a cheap HDTune 2.55 + HDTune Pro 5.00 for now.

Ah yes, you might be wondering why the CPU usage is so high. Well, these were just quick preliminary tests anyway, so some video transcoders were running in the background at the same time, that’s why. Here we go:

RAID-0, 8 × 15000rpm Cheetahs, reads:

RAID-0, 8 × 15000rpm Cheetahs, writes:

RAID-6, 11 × 15000rpm Cheetahs, reads in normal state:

RAID-6, 11 × 15000rpm Cheetahs, reads during initialization:

RAID-6, 11 × 15000rpm Cheetahs, reads in 2-disk degraded mode:

The performance degradation due to the initialization process is somewhat in line with what’s configured on the controller itself, giving the background process a low 20% priority. The degradation in 2-disk degraded mode is what’s really interesting though. Here we can see that the 1.2GHz dual core PowerPC RAID engine is seriously powerful. With double parity computation required on the fly, the array still delivers 64kiB transfer rates in excess of 800MiB/s! That’s insane! I was hoping for normal transfer rates over 600MiB/s, but this really waters ones mouth!

Of couse, all of this is still preliminary, my array still doesn’t work and these aren’t the final drives running through the tests, nor is the controller fully configured yet. Let’s just hope that I can get a grip on that situation soon… because all these problems are seriously pissing me off already, as you may be able to understand, given the price of the hardware and the pressing issue that I’m running out of space on my old array.

Well, let’s hope a real “part 4” can follow soon!

Edit: And finally, [here it is]!

Jun 052015
 

Taranis RAID-6 logoAfter [part 1] we now get to the second part of the Taranis RAID-6 array and its host machine. This time we’ll focus on the Areca controller fan modification or as I say “Noctuafication”, the real power supply instead of the dead mockup shown before and a modification of it (not the electronics, I won’t touch PSU electronics!) plus the new CPU cooler, which has been designed by a company which sits in my home country, Austria. It’s Noctuas most massive CPU cooler produced to this date, the NH-D15. Also, we’ll see some new filters applied to the side part of the case, and we’ll take a look at the cable management, which was a job much more nasty than it looks.

Now, let’s get to it and have a look at what was done to the Areca controller:

So as you can see above, the stock heatsink & fan unit was removed. Reason being that it emits a very high-pitched, loud noise, which just doesn’t fit into the new machine which creates more like a low-pitched “wind” sound. In my old box, which features a total of 19 40×40mm fans you wouldn’t hear the card, but now it’s becoming a disturbance.

Note that when doing this, the Arecas fan alarm needs to be disabled. What the controller does due to lack of a rpm signal cable is to measure the fan’s “speed” by measuring its power consumption. Now the original fan is a 12V DC 0.09A unit, whereas the Noctua only needs 0.06A, thus triggering the controllers audible alarm. In my case not so troublesome. Even if it would fail – which is highly unlikely for a Noctua in its first 10 years or so – there are still the two 120mm side fans.

Cooling efficiency is slightly lower now, with the temperature of the dual-core 1.2GHz PowerPC 476FP CPU going from ~60°C to ~65°C, but that’s still very much ok. The noise? Pretty much gone!

Now, to the continued build:

So there it is, although not yet with final hardware all around. In any case, even with all that storage goodness sitting in there, the massive Noctua NH-D15 simply steals the show here. That Xeon X5690 will most definitely never encounter any thermal issues! And while the NH-D15 doesn’t come with any S1366 mounting kit, Noctua will send you one NM-I3 for free, if you email them your mainboard or CPU receipt as well as the NH-D15 receipt to prove you own the hardware. Pretty nice!

In total we can see that cooler, the ASUS P6T Deluxe mainboard, the 6GB RAM that are just there for testing, the Areca ARC-1883ix-12, a Creative Soundblaster X-Fi XtremeMusic, and one of my old EVGA GTX580 3GB Classified cards. On the top right of the first shot you can also spot the slightly misaligned Areca flash backup module again.

While all my previous machines were in absolute chaos, I wanted to have this ONE clean build in my life, so there it is. For what’s inside in terms of cables, very little can be seen really. Considering 12 SAS lanes, 4 SATA cables, tons of power cables and extensions, USB+FW cables, fan cables, an FDD cable, 12 LED cathode traces bundled into 4 cables for the RAID status/error LEDs and I don’t know what else. Also, all the internal headers are used up. 4 × USB for the front panel, one for the combo drives’ card reader and one for the Corsair Link USB dongle of the power supply, plus an additional mini-Firewire connector at the rear.

Talking about the cabling, I found it nearly impossible to even close the rear lid of the tower, because the Great Cthulhu was literally sitting back there. It may not look like it, but it took me many hours to get it under some control:

Cable chaos under control

That’s a ton of cables. The thingy in the lower right is a Corsair Link dongle bridging the PSUs I²C header to USBXPress, so you can monitor the power supply in MS Windows.

Now it can be closed without much force at least! Lots of self-adhesive cable clips and some pads were used here, but that’s just necessary to tie everything down, otherwise it just won’t work at all. Two fan cables and resistors are sitting there unused, as the fans were re-routed to the mainboard headers instead, but everything else you can see here is actually necessary and in use.

Now, let’s talk about the power supply. You may have noticed it already in the pictures above, but this Corsair AX1200i doesn’t look like it should. Indeed, as said, I modified it with an unneeded fan grill I took out of the top of the Lian Li case. Reason is, that this way you can’t accidentally drop any screws into the PSU when working on the machine, and that can happen very quickly. If you miss just one, you’re in for one nasty surprise when turning the machine on! Thanks fly out to [CryptonNite]German flag, who gave me that idea. Of course you could just turn the PSU around and let it suck in air from the floor (The Lian Li PC-A79B supports this), but I don’t want to have to tend to the bottom dust filter all the time. So here’s what it looks like:

A modfied Corsair Professional Series Platinum AX1200i.

A modfied Corsair Professional Series Platinum AX1200i. Screws are no danger anymore!

With 150W of power at +5V, this unit should also be good enough for driving all that HDD drive electronics. Many powerful PSUs ignore that part largely and only deliver a lot at +12V for CPUs, graphics cards etc. Fact is, for hard drives you still need a considerable amount of 5V power! Looking at Seagates detailed specifications for some of the newer enterprise drives, you can see a peak current of 1.45A @ 5V in a random write scenario, which means 1.45A × 5V = 7.25W per disk, or 12 × 7.25W = 87W total for 12 drives. That, plus USB requiring +5V and some other stuff. So with 150W I should be good. Exactly the power that my beloved old Tagan PipeRock 1300W PSU also provided on that rail.

Now, as for the side panels:

And one more, an idea I got from an old friend of mine, [Umlüx]Austrian Flag. Since I might end up with a low pressure case with more air being blown out rather than sucked in, dust may also enter through every other unobstructed hole, and I can’t have that! So we shut everything out using duct tape and paper inlets (a part of which you have maybe seen on the power supply closeup already):

The white parts are all paper with duct tape behind it. The paper is there so that the sticky side of the tape doesn't attract dust, which would give the rear a very ugly look otherwise. As you can see, everything is shut tight, even the holes of the controller card. No entry for dust here!

The white parts are all paper with duct tape behind it. The paper is there so that the sticky side of the tape doesn’t attract dust, which would give the rear a very ugly look otherwise. As you can see, everything is shut tight, even the holes of the controller card. No entry for dust here!

That’s it for now, and probably for a longer time. The next thing is really going to be the disks, and since I’m going for 6TB 4Kn enterprise disks, it’s going to be terribly expensive. And that alone is not the only problem.

First we got the weak Euro now, which seems to be starting to hurt disk drive imports, and then there is this crazy storage “tax” (A literal translation would be “blank media compensation”) that we’re getting in October after years of debate about it in my country. The tax is basically supposed to cover the monetary loss of artists due to legal private recordings from radio or TV stations to storage media. The tax will affect every device that features any kind of storage device, whether mechanical/magnetic, optical or flash. That means individual disks, SSDs, blank DVDs/BDs, USB pendrives, laptops, desktop machines, cellphones and tablets, pretty much everything. Including enterprise class SAS drives.

Yeah, talk about some crazy and stupid “punish everybody across the board for what only a few do”! Thanks fly out to the Austro Mechana (“AUME”, something like “GEMA” in Germany) and their fat-ass friends for that. Collecting societies… legal, institutionalized, large-scale crime if you ask me.

But that means that I’m in between a rock and a hard place. What I need to do is to find the sweet spot between the idiot tax and the Euros currency rate, taking natural price decline into account as well. So it’s going to be very hard to pick the right time to buy those drives. And as I am unwilling to step down to much cheaper 512e consumer – or god forbid shingled magnetic recording – drives with read error rates as high as <1 in 1014 bits, we’re talking ~6000€ here at current prices. Since it’s 12 drives, even a small price drop will already have great effect.

We’ll see whether I’ll manage to make a good decision on that front. Also, space on my current array is getting less and less by the week, which is yet another thing I need to keep my eyes on.

Edit: [Part 3 is now ready]!