Nov 222016

FreeBSD IBM ServeRAID Manager logoAnd yet another FreeBSD-related post: After [updating] the IBM ServeRAID manager on my old Windows 2000 server I wanted to run the management software on any possible client. Given it’s Java stuff, that shouldn’t be too hard, right? Turned out not to be too easy either. Just copying the .jar file over to Linux and UNIX and running it like $ java -jar RaidMan.jar wouldn’t do the trick. Got nothing but some exception I didn’t understand. I wanted to have it work on XP x64 (easy, just use the installer) and Linux (also easy) as well as FreeBSD. But there is no version for FreeBSD?!

The ServeRAID v9.30.21 manager only supports the following operating systems:

  • SCO OpenServer 5 & 6
  • SCO Unixware 7.1.3 & 7.1.4
  • Oracle Solaris 10
  • Novell NetWare 6.5
  • Linux (only certain older distributions)
  • Windows (2000 or newer)

I started by installing the Linux version on my CentOS 6.8 machine. It does come with some platform-specific libraries as well, but those are for running the actual RAID controller management agent for interfacing with the driver on the machine running the ServeRAID controller. But I only needed the user space client program, which is 100% Java stuff. All I needed was the proper invocation to run it! By studying IBMs, I came up with a very simple way of launching the manager on FreeBSD by using this script I called (Java is required naturally):

  1. #!/bin/sh
  3. # ServeRAID Manager launcher script for FreeBSD UNIX
  4. # written by GAT.
  5. # Requirements: An X11 environment and java/openjdk8-jre
  7. curDir="$(pwd)"
  8. baseDir="$(dirname $0)/"
  10. mkdir ~/.serveraid 2>/dev/null
  11. cd ~/.serveraid/
  13. java -Xms64m -Xmx128m -cp "$baseDir"RaidMan.jar \
  14. -jar "$baseDir"RaidMan.jar $* < /dev/null >> RaidMan_StartUp.log 2>&1
  16. mv ~/RaidAgnt.pps ~/RaidGUI.pps ~/.serveraid/
  17. cd "$curDir"

Now with that you probably still can’t run everything locally (=in a FreeBSD machine with ServeRAID SCSI controller) because of the Linux libraries. I haven’t tried running those components on linuxulator, nor do I care for that. But what I can do is to launch the ServeRAID manager and connect to a remote agent running on Linux or Windows or whatever is supported.

Now since this server/client stuff probably isn’t secure at all (no SSL/TLS I think), I’m running this through an SSH tunnel. However, the Manager refuses to connect to a local port because “localhost” and “” make it think you want to connect to an actual local RAID controller. It would refuse to add such a host, because an undeleteable “local machine” is always already set up to begin with, and that one won’t work with an SSH tunnel as it’s probably not running over TCP/IP. This can be circumvented easily though!

Open /etc/hosts as root and enter an additional fantasy host name for I did it like that with “xin”:

::1			localhost xin		localhost xin

Now I had a new host “xin” that the ServeRAID manager wouldn’t complain about. Now set up the SSH tunnel to the target machine, I put that part into a script /usr/local/sbin/ Here’s an example, 34571 is the ServeRAID agents’ default TCP listen port, shall be the LAN IP of our remote machine hosting the ServeRAID array:

ssh -fN -p22 -L34571:

You’d also need to replace “mysshuser” with your user name on the remote machine, and “” with the Internet host name of the server via which you can access the ServeRAID machine. Might be the same machine or a port forward to some box within the remote LAN.

Now you can open the ServeRAID manager and connect to the made-up host “xin” (or whichever name you chose), piping traffic to and from the ServeRAID manager through a strongly encrypted SSH tunnel:

IBM ServeRAID Manager on FreeBSD

It even detects the local systems’ operating system “FreeBSD” correctly!


IBM ServeRAID Manager on FreeBSD

Accessing a remote Windows 2000 server with a ServeRAID II controller through an SSH tunnel, coming from FreeBSD 11.0 UNIX

IBM should’ve just given people the RaidMan.jar file with a few launcher scripts to be able to run it on any operating system with a Java runtime environment, whether Windows, or some obscure UNIX flavor or something else entirely, just for the client side. Well, as it stands, it ain’t as straight-forward as it may be on Linux or Windows, but this FreeBSD solution should work similarly on other systems as well, like e.g. Apple MacOS X or HP-UX and others. I tested this with the Sun JRE 1.6.0_32, Oracle JRE 1.8.0_112 and OpenJDK 1.8.0_102 for now, and even though it was originally built for Java 1.4.2, it still works just fine.

Actually, it works even better than with the original JRE bundled with RaidMan.jar, at least on MS Windows (no more GUI glitches).

And for the easy way, here’s the [package]! Unpack it wherever you like, maybe in /usr/local/. On FreeBSD, you need [archivers/p7zip] to unpack it and a preferably modern Java version, like [java/openjdk8-jre], as well as X11 to run the GUI. For easy binary installation: # pkg install p7zip openjdk8-jre. To run the manager, you don’t need any root privileges, you can execute it as a normal user, maybe like this:

$ /usr/local/RaidMan/

Please note that my script will create your ServeRAID configuration in ~/.serveraid/, so if you want to run it as a different user or on a different machine later on, you should recursively copy that directory to the new user/machine. That’ll retain the local client configuration.

That should do it! :)

Nov 212016

IBM ServeRAID Manager logoBelieve it or not, the server hosting the very web site you’re reading right now has all of its data stored on an ancient IBM ServeRAID II array made in the year 1995. That makes the SCSI RAID-5 controller 21 years old, and the 9.1GB SCA drives attached to it via hot-plug bays are from 1999, so 17 years old. Recently, I found out that IBMs’ latest SCSI ServeRAID manager from 2011 still supports that ancient controller as well as the almost equally ancient Windows 2000 Server I’m running on the machine. In hope for better management functionality, I chose to give the new software a try. So additionally to my antiquated NT4 ServeRAID manager v2.23.3 I’d also run v9.30.21 side-by-side! This is also in preparation for a potential upgrade to a much newer ServeRAID-4H and larger SCSI drives.

Just so you know how the old v2.23.3 looks, here it is:

IBM ServeRAID Manager v2.23.3

IBM ServeRAID Manager v2.23.3

It really looks like 1996-1997 software? It can do the most important tasks, but there are two major drawbacks:

  1. It can’t notify me of any problems via eMail
  2. It’s a purely standalone software, meaning no server/client architecture => I have to log in via KVM-over-IP or SSH+VNC to manage it

So my hope was that the new software would have a server part and a detachable client component as well as the ability to send eMails whenever shit happens. However, when first launching the new ServeRAID manager, I was greeted with this:

ServeRAID Manager v9.30.21 GUI failure

Now this doesn’t look right… (click to enlarge)

Note that this was my attempt to run the software on Windows XP x64. On Windows 2000, it looked a bit better, but still somewhat messed up. Certain GUI elements would pop up upon mouseover, but overall, the program just wasn’t usable. After finding out that this is Java software being executed by a bundled and ancient version of Sun Java (v1.4.2_12), i just tried to run the RaidMan.jar file with my platform Java. On XP x64 that’s the latest and greatest Java 1.8u112 (even though the installer says it needs a newer operating system this seems to work just fine) and on Windows 2000 it’s the latest supported on that OS: Java 1.6u31. To make RaidMan.jar run on a different JRE on Windows, you can just alter the shortcut the installer creates for you:

Changing the JRE that ServeRAID Manager should be executed by

Changing the JRE that ServeRAID Manager should be executed by

Here it’s run by the javaw.exe command that an old JDK 1.7.0 installer created in %WINDIR%\system32\. It was only later that I changed it to 1.8u112. After changing the JRE to a more modern one, everything magically works:

ServeRAID Manager v9.30.21, logged in

ServeRAID Manager v9.30.21, remotely logged in to my server (click to enlarge)

And this is already me having launched the Manager component on a different machine on my LAN, connecting to the ServeRAID agent service running on my server. So that part works. Since this software also runs on Linux and FreeBSD UNIX, I can set up a proper SSH tunnel script to access it remotely and securely from the outside world as well. Yay! Clicking on the controller gave me this:

ServeRAID Manager v9.30.21 array overview

Array overview (click to enlarge)

Ok, this reminds me of Adaptecs’/ICPs’ StorMan, and since there is some Adaptec license included on the IBM Application CD that this version came from, it might very well be practically the same software. It does show warnings on all drives, while the array and volume are “ok”. The warnings are pretty negligible though, as you can already see above, let’s have a more detailed look:

ServeRAID Manager v9.30.21 disk warranty warnings

So I have possible non-warranted drives? No shit, sherlock! Most of them are older than the majority of todays’ Internet users… I still don’t get how 12 of these drives are still running, seriously… (click to enlarge)

So that’s not really an issue. But what about eMail notifications? Well, take a look:

ServeRAID Manager v9.30.21 notification options

It’s there! (click to enlarge)

Yes! It can notify to the desktop, to the system log and to various email recipients. Also, you can choose who gets which mails by selecting different log levels for different recipients. The only downside is, that the ServeRAID manager doesn’t allow for SSL/TLS connections to mail servers and it can’t even provide any login data. As such, you need your own eMail server on your local network, that allows for unauthenticated and unencrypted SMTP access from the IP of your ServeRAID machine. In my case, no problem, so I can now get eMail notifications to my home and work addresses, as well as an SMS by using my 3G providers’ eMail-2-SMS gateway!

On top of that, you can of course check out disk and controller status as well:

ServeRAID Manager v9.30.21 disk status

Disk status – not much to see here at all (on none of the tabs), probably because the old ServeRAID II can’t do S.M.A.R.T. Maybe good that it can’t, I don’t really want to see 17 year old hard drives’ S.M.A.R.T. logs anyway. ;)


ServeRAID Manager v9.30.21 controller status

Status of my ServeRAID II controller, no battery backup unit attached for the 4MB EDO-DRAM write cache and no temperature sensors present, so not much to see here either.

Now there is only one problem with this and that is that the new ServeRAID agent service consumes quite a lot of CPU power in the background, showing as 100% peaks on a single CPU core every few seconds. This is clearly visible in my web-based monitoring setup:

ServeRAID Manager v9.30.21 agent CPU load

The background service is a bit too CPU hungry for my taste (Pentium Pro™ 200MHz). The part left of the “hole” is before installation, the part right of it after installation.

And in case you’re wondering what that hole is right between about 20:30 and 22:00, that’s the ServeRAID Managers’ SNMP components which killed my Microsoft SNMP services upon installation. My network and CPU monitoring solution is based on SNMP though, so that was not good. Luckily, just restarting the SNMP services fixed it. However, as you can see, one of the slow 200MHz cores is now under much higher load. I don’t like that because I’m short on CPU power all the time anyway, but I’ll leave it alone for now, let’s see how it goes.

ServeRAID Manager v9.30.21 splash screen

“Fast configuration”, but a pretty slow background service… :roll:

Now all I need to get is a large pack of large SCA SCSI drives, since I still have that much faster [ServeRAID 4H] with 128MB SDRAM cache and BBU lying around for 3 years anyway! Ah, and as always, the motivation to actually upgrade the server. ;)

Edit: It turns out I found the main culprit for the high CPU load. It seems to be IBMs’ [SNMP sub-agent component] after all, the one that also caused my SNMP service to shut down upon installation. Uninstalling the ServeRAID Manager v9.30.21 and reinstalling it with the SNMP component deselected resulted in a different load profile. See the following graph, the vertical red line separates the state before (with SNMP sub-agent) from the state after (without SNMP sub-agent). Take a look at the magenta line depicting the CPU core that the RAID service was bound to:

ServeRAID Manager v9.30.21 with reduced CPU load

Disabling the ServeRAID managers’ SNMP sub-agent lowers the CPU load significantly!

Thanks fly out to [these guys at Ars Technica] for giving me the right idea!

Aug 282016

KERNEL_DATA_INPAGE_ERROR logoHere is how a responsible system administrator should handle downtimes and replacements of faulty hardware: Give advance notice to all users and make sure to give everybody enough time to prepare for services going offline, if possible. Specify a precise time window which is as convenient as possible for most users. Also, explain the exact technical reasons in words as simple as possible.

How I handled the replacement of XINs’ system hard disk? See that nice blue logo on the top left side? KERNEL_DATA_INPAGE_ERROR, bugcheck code 0x0000007a. And [it isn’t the first of its kind either], last one was a KERNEL_STACK_INPAGE_ERROR, clearly disk related given that the disk had logged controller errors as well as unrecoverable dead sectors. And NO, that one wasn’t the first one too. :roll: So yeah, I rebooted the [monster], and decided that it’s too much of a pain in the ass to fix it and hoped (=told myself while in denial) that it would just live on happily ever after! Clearly in ignorance of the obvious problem, just so I could walk over to my workstation and continue to watch some Anime and have a few cold ones in peace…

So, my apologies for being lazy in a slightly dangerous way this time. Well, it’s not like there aren’t any system backups or anything, but still. In the end, it caused an unannounced and unplanned downtime 3½ hours long. This still shouldn’t hurt XINs’ >=99% yearly availability, but it clearly wasn’t the right way to deal with it either…

Well, it’s fixed now, because this time I got a bit nervous and pissed off as well. Thanks to [Umlüx], the XIN server is now running a factory-new HP/Compaq 15000rpm 68p LVD/SE SCSI drive, essentially a Seagate Cheetah 15k.3. As I am writing this the drive has only 2.9h of power on time accumulated. Pretty nice to find such pristine hardware!

Thanks do however also fly out to [Grindhavoc]German flag and [lommodore]German flag from [Voodooalert]German flag, who also kindly provided a few drives, of which some were quite usable. They’re in store now, for when the current HP drive starts behaving badly.

Now, let’s hope it was just the disk and no Controller / cabling problem on top of that, but it looks like this should be it for now. One less thing to worry about as well. ;)

Jun 022016

KERNEL_STACK_INPAGE_ERROR logoRecently, this server (just to remind you: an ancient quad Pentium Pro machine with SCSI storage and FPM DRAM) experienced a 1½ hour downtime due to a KERNEL_STACK_INPAGE_ERROR bluescreen, stop code 0x00000077. Yeah yeah, I’m dreaming about running OpenBSD on, but it’s still the same old Windows server system. Bites, but hard to give up and/or migrate certain pieces of software. In any case, what exactly does that mean? In essence, it means that the operating systems’ paged pool memory got corrupted. So, heh?

More clearly, either a DRAM error or a disk error, as not the entire paged pool needs to actually be paged to disk. The paged pool is swappable to disk, but not necessarily swapped to disk. So we need to dig a bit deeper. Since this server has 2-error correction and 3-error reporting capability for its memory due to IBM combining the parity FPM-DRAM with additional ECC chips, we can look for ECC/parity error reports in the servers’ system log. Also, disk errors should be pretty apparent in the log. And look what we’ve got here (The actual error messages are German even though the log is being displayed on a remote, English system – well, the server itself is running a German OS):

Actually, when grouping the error log by disk events, I get this:

54 disk errors in total - 8 of which were dead sectors

54 disk errors in total – 8 of which were medium errors – dead sectors

8 unrecoverable dead sectors and 46 controller errors, starting from march 2015 and nothing before that date. Now the actual meaning of a “controller error” isn’t quite clear. In case of SCSI hardware like here, it could be many things. Starting from firmware issues over cabling problems all the way to wrong SCSI bus terminations. Judging from the sporadic nature and the limited time window of the error I guess it’s really failing electronics in the drive however. The problems started roughly 10 years after that drive was manufactured, and it’s an 68-pin 10.000rpm Seagate Cheetah drive with 36GB capacity by the way.

So yeah, march 2015. Now you’re gonna say “you fuck, you saw it coming a long time ago!!”, and yeah, what can I say, it’s true. I did. But you know, looking away while whistling some happy tune is just too damn easy sometimes. :roll:

So, what happened exactly? There are no system memory errors at all, and the last error that has been reported before the BSOD was a disk event id 11, controller error. Whether there was another URE (unrecoverable read error / dead sector) as well, I can’t say. But this happened exactly before the machine went down, so I guess it’s pretty clear: The NT kernel tried to read swapped kernel paged pool memory back from disk, and when the disk error corrupted that critical read operation (whether controller error or URE), the kernel space memory got corrupted in the process, in which case any kernel has to halt the operating system as safe operation can no longer be guaranteed.

So, in the next few weeks, I will have to shut the machine down again to replace the drive and restore from a system image to a known good disk. In the meantime I’ll get some properly tested drives and I’m also gonna test the few drives I have in stock myself to find a proper replacement in due time.

Thank god I have that remote KVM and power cycle capabilities, so that even a non-ACPI compliant machine like the server can recover from severe errors like this one, no matter where in the world I am. :) Good thing I spent some cash on an expensive UPS unit with management capabilities and that KVM box…

Dec 202015

Taranis RAID-6 logoAnd here’s another minor update after [part 4½] of my RAID array progress log. Since I was thinking that weekly RAID verifications would be too much for an array this size (because I thought it would take too long), I set the Areca controller to scrub and error-check my disks at an interval of four weeks. Just a shame that the thing doesn’t feature a proper scheduler with a calendar and configurable starting times for this. All you can tell it is to “check it every n weeks”. In any case, the verify completed this night, running for a total of 29:07:29 (so: 29 hours) across those 12 × 6TB HGST Ultrastar disks, luckily with zero bad blocks detected. Would’ve been a bit early for unrecoverable read errors to occur anyway. ;)

So this amounts to a scrub speed just shy of 550MiB/s, which isn’t blazingly fast for this array, but it’s acceptable I think. The background process priority during this operation was set to “Low (20%)”, and there have been roughly 150GiB of I/O during the disk scrubbing. Most of that I/O was concentrated in one fast Blu-Ray demux, but some video encoders were also running, reading and writing small amounts of data all the time. I guess I can live with that result.

Ah yeah, I should also show you the missing benchmarks, but before that, here’s a more normal photograph of the final system (where “normal” means “not a nightshot”. It does NOT mean “in a proper colorspace”, cause the light sources were heavily mixed, so the colors suck once again! ;) ):

The Taranis system during daytime

The “Taranis” RAID-6 system during daytime

And here are the missing benchmarks on the finalized array in a normal state. Once again, this is an Areca ARC-1883ix-12 with 12 × HGST Ultrastar 7k6000 6TB SAS disks in RAID-6 at an aligned stripe block size of 64kiB. The controller is equipped with FBM-backed 2GiB of Reg. ECC DDR-III/1866 write-back cache, and each individual drive features 128MiB of write-through cache (I have no UPS unit for this machine, which is why the drive caches themselves aren’t configured for write-back). The controller is configured to read & discard parity data to reduce seeks and is thus tuned for maximum sequential read performance. The benchmarking software was HDTune 2.55 as well as HDTune Pro 5.00:

With those modern Ultrastars instead of the old Seagate Cheetah 15k drives, the only thing that turned out to be worse is the seeking time. Given that it’s 3.5″ 7200rpm platters vs. 2.5″ 15000rpm platters that’s only natural though. Sequential throughput is a different story though: At large enough block sizes we get more than 1GiB/s almost consistently, for both reads and writes. Again, I’d have loved to try 4k writes as well, but HDTune Pro would just crash when picking that block size, same as with the Cheetah drives. Anyhow, 4k performance is nice as well. I’d give you some ASSSD numbers, but it fails to even see the array at all.

What I’ve seen in some other reviews holds true here too though: The Ultrastars do seem to fluctuate partly when it comes to performance. We can see that for the 64kiB reads as well as the 512kiB and 1MiB writes. On average though, raw read and write performance is absolutely stellar, just like ATTO, HDTach and Everst/Aida64 tests have suggested before as well. That IBM 1.2GHz [PowerPC 476] dual core chip is truly a monster in comparison to what I’ve seen on older RAID-6 controllers.

I’ve compared this to my old 3ware 9650SE-8LPML (AMCC [PowerPC 405CR] @ 266MHz), to an Adaptec-built ICP Vortex 5085BR (Intel [XScale IOP333] @ 800MHz), both with 8 × 7200rpm SATA disks and even to a Hewlett Packard MSA2312fc SAN with 12 × 15000rpm SAS Cheetahs (AMD [Turion 64 MT-32] 1800MHz). All of them are simply blown out of the water in every way thinkable: Performance, manageability, and if I were to consider the MSA2312fc as a serious contender as well (it isn’t exactly meant as a simple local block device): Stability too. I couldn’t tell how often those freaking management controllers are crashing on that thing and have to be rebooted via SSH…

So this thing has been up for about 4 weeks now. Still looking good so far…

Summer will be interesting, with some massive heat and all. We’ll see it that’ll trigger the temperature alarms of the HDD bays…

Dec 012015

Taranis RAID-6 logoWhile there has been quite some trouble with the build of my new storage array, as you can see in the last [part 3½], everything seems to have been resolved now. As far as tests have shown, the instability issues with my drives have indeed been caused by older Y-cables used to support all eight 4P molex plugs of my Chieftec 2131SAS drive bays. This was necessary, as all plugs on the Corsair AX1200i power supply had been used up, partly to support the old RAID-6 arrays 8 × SATA power plugs as well.

To fix it, I just ripped out half of the Y-cables, more specifically those connected to the bays which showed trouble, and hooked the affected bays up to a dedicated ATX power supply. The no-name 400W PSU used for this wasn’t stable with zero load on the ATX cable however, so just shorting the green and grey cables on the ATX plug didn’t work. Happens for a lot of ATX PSUs, so I hooked another ASUS P6T Deluxe up to it, which stabilized all voltage rails.

After that, a full encryption of the (aligned) GPT partition created on the device, rsync for 3 days, then a full diff for a bit more than 2 days, and yep. Everything worked just as planned, all 10.5TiB of my data was synced over to the new array correctly and without any inconsistencies. After that, I ripped out the old array, and did the cabling properly, and well – still no problems at all!

Taranis RAID-6 freshly filled with Data from the old Helios 10.9TiB Array

With everything having been copied over, that little blue triangle still has ways to go trying to eat up Taranis!

I do have to apologize for not giving you pictures of the 12 drives though, but while completing everything, I was just in too much of a rush to get everything done, so no ripping out of disks for photos. :( Besides some additional benchmarks I can give you a few nightshots of the machine though. This is with my old 3ware 9650SE-8LPML card and all of its drives removed already. Everything has been cleaned one last time, the flash backup module reconnected to the Areca ARC-1883ix-12, the controllers management interface itself hooked up to my LAN and made accessible via a SSH tunnel and all status-/error-LED headers hooked up in the correct order.

For the first one of these images, the error LEDs have been lit manually via Arecas “identify enclosure” function applied to the whole SAS expander chip on the card:

The drive bays’ power LEDs are truly insanely bright. The two red error LEDs that each bay has – one for fan failure, one for overheating – are off here. What you can see are the 12 drive bays’ activity and status LEDs as well as the machines’ power LED. The red system SSD LED and the three BD-RW drive LEDs are off. It’s still a nice christmas tree. ;)

The two side intakes, Noctua 120mm fans in this case, filtered by Silverstone ultra-fine dust filters let some green light through. This wasn’t planned, and it’s caused by the green LEDs of the GeForce GTX Titan Black inside. It’s quite dim though. The fans a live savers by the way, as they keep the Areca RAID controllers’ dual-core 1.2GHz PowerPC 476 processor at temperatures <=70°C instead of something close to 90°C. The SAS expander chip sits at around 60°C with the board temperature at 38°C, and the flash backup module temperature is at ~40°C. All of this at an ambient testing temperature of 28°C after 4 hours of runtime. So that part’s perfectly fine.

Only problem are the drives, which can still reach temperatures as high as 49-53°C. While the trip temperature of the drives is 85°C, everything approaching 60°C should already be quite unhealthy. We’ll see how well that goes, but hopefully it’ll be fine for them. My old 2TiB A7K2000 Ultrastars ran for what is probably a full accumulated year at ~45°C without issues. Hm…

In any case, some more benchmarks:

Taranis RAID-6 running ATTO disk benchmark v2.47

The Taranis RAID-6 running ATTO disk benchmark v2.47, 12 × Ultrastar 7K6000 SAS @ ARC-1883ix-12 in RAID-6, results are kiB/s



In contrast to some really nice theoretical results, practical tests with [dd] and [mkvextract+mkvmerge] show, that the transfer rate on the final, encrypted and formatted volume sits somewhere in between 500-1000MiB/s for very large sequential transfers with large block sizes, which is what I’m interested in. While the performance loss seems significant when taking the proper partition-to-stripe-width-alignment and the multi-threaded, AES-NI boosted encryption into account, it’s still nothing to be ashamed of at all. In the end, this is by several factors faster than the old array which delivered roughly 200-250MiB/s or rather less at the end, with severe fragmentation beginning to hurt the file system significantly.

Ah yes, one more thing that might be interesting: Power consumption of the final system! To measure this, I’m gonna rely on the built-in monitoring and management system of my Corsair AX1200i power supply again. But first, a list of the devices hooked up to the PSU:

  • ASUS P6T Deluxe mainboard, X58 Tylersburg chipset
  • 3 × 8 = 24GB DDR-III/1066 CL8 SDRAM (currently for testing, would otherwise be 48GB)
  • Intel Xeon X5690 3.46GHz hexcore processor, not overclocked, idle during testing
  • nVidia GeForce GTX Titan Black, power target at 106%, not overclocked, idle during testing
  • Areca ARC-1883ix-12 controller + ARC-1883-CAP flash backup module
  • Auzentech X-Fi Prelude 7.1
  • 1 × Intel 320 SSD 600GB, idle during testing
  • 3 × LG HL-DT-ST BH16NS40 BD-RW drives, idle during testing
  • 1 × Teac FD-CR8 combo drive (card reader + FDD), idle during testing
  • 12 × Hitachi Global Storage Ultrastar 7K6000 6TB SAS/12Gbps, sequential transfer during testing
  • 4 × Chieftec 2131SAS HDD bays
  • 2 × Noctua NF-A15 140mm fans
  • 2 × Noctua NF-A14 PWM 140mm fans
  • 3 × Noctua NF-F12 PWM 120mm fans
  • 4 × Noctua NF-A8 FLX 80mm fans (in the drive bays)
  • 1 × Noctua NF-A4x10 40mm fan
  • 1 × unspecified 140mm PWM fan in the power supply
Full system load with the new Taranis RAID-6 array

Full system load with the new Taranis RAID-6 array

So we’re still under the 300W mark, which I had originally expected to be cracked, since the old system was in the same ballpark when it comes to power consumption. But the old system had an overclocked i7 980X instead of this seriously cool-running Xeon as well (it has a low VID, it’s cooler even on stock settings).

Now all that’s missing is the adaptation of my old scripts checking the RAID controller and drive status periodically. For this, I was using 3wares tw_cli tool and SmartMonTools originally. I’ll continue to use the SmartMonTools of course, as they’ve been adapted to make use of Arecas API as well, thus being able to fetch S.M.A.R.T. data from all individual drives in the array. The tw_cli part will have to be replaced with Arecas own command line tool though, including a lot of post-processing with Perl to publish this in a nice HTML form again. When it’s done, the stats will be reachable [here].

Depending on how extremely my laziness and my severe Anime addiction bog me down, this may take a few days. Or weeks. :roll:

Edit: Ah, actually, I was motivated enough to do it, cost me several hours, inflicted quite some pain due to the weirdness of Microsoft Batch, but it’s done, the RAID-6 web status reporting script is back online! More (including the source code) in [part 4½]!

Nov 202015

Taranis RAID-6 logoYeah, after [part 3] it should be “part 4”. The final stage. However, while I’d love to present my final ~55TiB RAID-6 to you, I cannot do so yet, because there were and probably are some severe issues with the setup, which I will talk about down below. So, since my level of trust for Seagate is rather low because of the failure rates reported by Backblaze and my own experiences at work as well as the experiences from some other administrators I know, their line of Enterprise disks was out of the game. Another option would’ve been Hitachis Helium-filled Ultrastar He8, but since the He6 was reportedly rather disastrous, I don’t really want to trust those drives either.

This Helium stuff is just so new and daring, that I don’t want to trust them to be the very base of a RAID array that’s supposed to last for many, many years just yet.

Ultimately, I decided to get myself 12 insanely expensive Hitachi Ultrastar 7K6000 disks, “The last in Air” as they call ’em themselves. That’s a classic 5-platter 10-head airfilled enterprise disk with 7200rpm rotational speed and 6TB of capacity. I got the SAS/12Gbps version which also boasts 128MiB of cache. Mechanically, that’s all the same old tech that I’ve already been using with my 8 × 1TB Deskstars and now 8 × 2TB Ultrastars, so it’s something I can trust. However, as I said, there were/are some very serious issues. Maybe you remember this image:

"Helios" RAID-6 array emergency migration

Old array to the left…

So my old RAID-6 based on a 3ware 9650SE-8LPML with 8 × 2TB Ultrastars is sitting on the table, while the new one has been plugged into the Chieftec 2131SAS bays and hooked up to the Areca ARC-1883ix-12. Both RAID systems are thus connected to the same host machine at the same time, making it a total of 20 drives. This is supposed to make data migration using rsync very convenient and easy.

The problem is that I didn’t have enough power connectors for this (12 × SATA for the old array, ODDs and SSD, 8 × 4P Molex for the SAS bays), so I settled for Y-adapters to hook up the new array. Then the trouble started. At first I thought it was the passive SAS bays to blame. But as I continued my tests, drives would behave slightly differently as I exchanged and rotated the Y cables. What I observed was some weird “jitter”, where the drive heads were audibly moving around were they shouldn’t have, and sometimes drives would stall for a moment as well.

Ultimately, the array ran into a massive failure during init at about 60%, and 4-5 drives successively failed, collecting tons of recoverable read AND write errors in their S.M.A.R.T. logs. Bleh… At least no unrecoverable ones, but still…

At this point I ripped out half of the Y cables and hooked two of the four bays up to a dedicated power supply (only two, because of a lack of plugs). It seems this greatly changed the behavior of the whole setup, stabilizing it significantly. Of course it’s too early to say anything for sure, because now I’m just at roughly 25% through the second initialization process. But if I’m right, then a few 1€ parts have successfully wrecked a ~8000€ RAID array, now that’s something, eh?

In any case, before getting my Ultrastars I also tried the system with some Seagate Cheetah 15k.6 and 15k.7 drives I managed to borrow at work, 300GB 15000rpm SAS pieces, just for some benchmarks. Since those showed more severe problems even than the Hitachis (probably because they’re more power hungy?), I went down to 11, then 8 drives. Some of the benches will also show sudden stalls. Yeah. That’s the power issue.

Well, it can still serve as a quick glance at the performance levels one can expect with the Areca ARC-1883ix-12, even in such a state. Let me just say: It is a nice feeling to see a RAID array based on mechanical drives push 1000-1200MiB/s over the bus on average, reading at 64kiB-1MiB block sizes. At least that part is undeniably awesome! Here are a few screenshots for you, RAID stripe block sizes are always 64kiB, read block sizes are 4kiB, 64kiB and 1MiB, write block sizes are 64kiB, 512kiB and 1MiB. For the RAID-6 setup there are also benches during init and in 2-disk degraded mode, software’s just a cheap HDTune 2.55 + HDTune Pro 5.00 for now.

Ah yes, you might be wondering why the CPU usage is so high. Well, these were just quick preliminary tests anyway, so some video transcoders were running in the background at the same time, that’s why. Here we go:

RAID-0, 8 × 15000rpm Cheetahs, reads:

RAID-0, 8 × 15000rpm Cheetahs, writes:

RAID-6, 11 × 15000rpm Cheetahs, reads in normal state:

RAID-6, 11 × 15000rpm Cheetahs, reads during initialization:

RAID-6, 11 × 15000rpm Cheetahs, reads in 2-disk degraded mode:

The performance degradation due to the initialization process is somewhat in line with what’s configured on the controller itself, giving the background process a low 20% priority. The degradation in 2-disk degraded mode is what’s really interesting though. Here we can see that the 1.2GHz dual core PowerPC RAID engine is seriously powerful. With double parity computation required on the fly, the array still delivers 64kiB transfer rates in excess of 800MiB/s! That’s insane! I was hoping for normal transfer rates over 600MiB/s, but this really waters ones mouth!

Of couse, all of this is still preliminary, my array still doesn’t work and these aren’t the final drives running through the tests, nor is the controller fully configured yet. Let’s just hope that I can get a grip on that situation soon… because all these problems are seriously pissing me off already, as you may be able to understand, given the price of the hardware and the pressing issue that I’m running out of space on my old array.

Well, let’s hope a real “part 4” can follow soon!

Edit: And finally, [here it is]!

Jun 052015

Taranis RAID-6 logoAfter [part 1] we now get to the second part of the Taranis RAID-6 array and its host machine. This time we’ll focus on the Areca controller fan modification or as I say “Noctuafication”, the real power supply instead of the dead mockup shown before and a modification of it (not the electronics, I won’t touch PSU electronics!) plus the new CPU cooler, which has been designed by a company which sits in my home country, Austria. It’s Noctuas most massive CPU cooler produced to this date, the NH-D15. Also, we’ll see some new filters applied to the side part of the case, and we’ll take a look at the cable management, which was a job much more nasty than it looks.

Now, let’s get to it and have a look at what was done to the Areca controller:

So as you can see above, the stock heatsink & fan unit was removed. Reason being that it emits a very high-pitched, loud noise, which just doesn’t fit into the new machine which creates more like a low-pitched “wind” sound. In my old box, which features a total of 19 40×40mm fans you wouldn’t hear the card, but now it’s becoming a disturbance.

Note that when doing this, the Arecas fan alarm needs to be disabled. What the controller does due to lack of a rpm signal cable is to measure the fan’s “speed” by measuring its power consumption. Now the original fan is a 12V DC 0.09A unit, whereas the Noctua only needs 0.06A, thus triggering the controllers audible alarm. In my case not so troublesome. Even if it would fail – which is highly unlikely for a Noctua in its first 10 years or so – there are still the two 120mm side fans.

Cooling efficiency is slightly lower now, with the temperature of the dual-core 1.2GHz PowerPC 476FP CPU going from ~60°C to ~65°C, but that’s still very much ok. The noise? Pretty much gone!

Now, to the continued build:

So there it is, although not yet with final hardware all around. In any case, even with all that storage goodness sitting in there, the massive Noctua NH-D15 simply steals the show here. That Xeon X5690 will most definitely never encounter any thermal issues! And while the NH-D15 doesn’t come with any S1366 mounting kit, Noctua will send you one NM-I3 for free, if you email them your mainboard or CPU receipt as well as the NH-D15 receipt to prove you own the hardware. Pretty nice!

In total we can see that cooler, the ASUS P6T Deluxe mainboard, the 6GB RAM that are just there for testing, the Areca ARC-1883ix-12, a Creative Soundblaster X-Fi XtremeMusic, and one of my old EVGA GTX580 3GB Classified cards. On the top right of the first shot you can also spot the slightly misaligned Areca flash backup module again.

While all my previous machines were in absolute chaos, I wanted to have this ONE clean build in my life, so there it is. For what’s inside in terms of cables, very little can be seen really. Considering 12 SAS lanes, 4 SATA cables, tons of power cables and extensions, USB+FW cables, fan cables, an FDD cable, 12 LED cathode traces bundled into 4 cables for the RAID status/error LEDs and I don’t know what else. Also, all the internal headers are used up. 4 × USB for the front panel, one for the combo drives’ card reader and one for the Corsair Link USB dongle of the power supply, plus an additional mini-Firewire connector at the rear.

Talking about the cabling, I found it nearly impossible to even close the rear lid of the tower, because the Great Cthulhu was literally sitting back there. It may not look like it, but it took me many hours to get it under some control:

Cable chaos under control

That’s a ton of cables. The thingy in the lower right is a Corsair Link dongle bridging the PSUs I²C header to USBXPress, so you can monitor the power supply in MS Windows.

Now it can be closed without much force at least! Lots of self-adhesive cable clips and some pads were used here, but that’s just necessary to tie everything down, otherwise it just won’t work at all. Two fan cables and resistors are sitting there unused, as the fans were re-routed to the mainboard headers instead, but everything else you can see here is actually necessary and in use.

Now, let’s talk about the power supply. You may have noticed it already in the pictures above, but this Corsair AX1200i doesn’t look like it should. Indeed, as said, I modified it with an unneeded fan grill I took out of the top of the Lian Li case. Reason is, that this way you can’t accidentally drop any screws into the PSU when working on the machine, and that can happen very quickly. If you miss just one, you’re in for one nasty surprise when turning the machine on! Thanks fly out to [CryptonNite]German flag, who gave me that idea. Of course you could just turn the PSU around and let it suck in air from the floor (The Lian Li PC-A79B supports this), but I don’t want to have to tend to the bottom dust filter all the time. So here’s what it looks like:

A modfied Corsair Professional Series Platinum AX1200i.

A modfied Corsair Professional Series Platinum AX1200i. Screws are no danger anymore!

With 150W of power at +5V, this unit should also be good enough for driving all that HDD drive electronics. Many powerful PSUs ignore that part largely and only deliver a lot at +12V for CPUs, graphics cards etc. Fact is, for hard drives you still need a considerable amount of 5V power! Looking at Seagates detailed specifications for some of the newer enterprise drives, you can see a peak current of 1.45A @ 5V in a random write scenario, which means 1.45A × 5V = 7.25W per disk, or 12 × 7.25W = 87W total for 12 drives. That, plus USB requiring +5V and some other stuff. So with 150W I should be good. Exactly the power that my beloved old Tagan PipeRock 1300W PSU also provided on that rail.

Now, as for the side panels:

And one more, an idea I got from an old friend of mine, [Umlüx]Austrian Flag. Since I might end up with a low pressure case with more air being blown out rather than sucked in, dust may also enter through every other unobstructed hole, and I can’t have that! So we shut everything out using duct tape and paper inlets (a part of which you have maybe seen on the power supply closeup already):

The white parts are all paper with duct tape behind it. The paper is there so that the sticky side of the tape doesn't attract dust, which would give the rear a very ugly look otherwise. As you can see, everything is shut tight, even the holes of the controller card. No entry for dust here!

The white parts are all paper with duct tape behind it. The paper is there so that the sticky side of the tape doesn’t attract dust, which would give the rear a very ugly look otherwise. As you can see, everything is shut tight, even the holes of the controller card. No entry for dust here!

That’s it for now, and probably for a longer time. The next thing is really going to be the disks, and since I’m going for 6TB 4Kn enterprise disks, it’s going to be terribly expensive. And that alone is not the only problem.

First we got the weak Euro now, which seems to be starting to hurt disk drive imports, and then there is this crazy storage “tax” (A literal translation would be “blank media compensation”) that we’re getting in October after years of debate about it in my country. The tax is basically supposed to cover the monetary loss of artists due to legal private recordings from radio or TV stations to storage media. The tax will affect every device that features any kind of storage device, whether mechanical/magnetic, optical or flash. That means individual disks, SSDs, blank DVDs/BDs, USB pendrives, laptops, desktop machines, cellphones and tablets, pretty much everything. Including enterprise class SAS drives.

Yeah, talk about some crazy and stupid “punish everybody across the board for what only a few do”! Thanks fly out to the Austro Mechana (“AUME”, something like “GEMA” in Germany) and their fat-ass friends for that. Collecting societies… legal, institutionalized, large-scale crime if you ask me.

But that means that I’m in between a rock and a hard place. What I need to do is to find the sweet spot between the idiot tax and the Euros currency rate, taking natural price decline into account as well. So it’s going to be very hard to pick the right time to buy those drives. And as I am unwilling to step down to much cheaper 512e consumer – or god forbid shingled magnetic recording – drives with read error rates as high as <1 in 1014 bits, we’re talking ~6000€ here at current prices. Since it’s 12 drives, even a small price drop will already have great effect.

We’ll see whether I’ll manage to make a good decision on that front. Also, space on my current array is getting less and less by the week, which is yet another thing I need to keep my eyes on.

Edit: [Part 3 is now ready]!

May 282015

Taranis RAID-6 logoTodays post shall be about storage. My new storage array actually. I wanted to make this post episodic, with multiple small posts that make sort of a build log, but since I’m so damn lazy, I never did that. So by now, I have quite some material piled up, which you’re all getting in one shot here. This is still not finished however, so don’t expect any benchmarks or even disks – yet! Some parts will be published in the near future, in the episodic manner I had actually intended to go for. So…

I’ve been into parity RAID (redundant array of independent/inexpensive disks) since the days of PATA/IDE with the Promise Supertrak SX6000, which I got in the beginning of 2003. At first with six 120GB Western Digital disks in RAID-5 (~558GiB of usable capacity), then upgraded to six 300GB Maxtor MaxLine II disks (~1.4TiB, the first to break the TiB barrier for me). It was very stable, but so horribly slow and fragmented at the end, that playback of larger video files – think HDTV, Blu-Rays were hitting the market around that time – became impossible, and the space was once again filled up at the end of 2005 anyway.

2006, that was when I got the controller I’m still using today, the 3ware 9650SE-8LPML. Typically, I’d say that each upgrade has to give me double capacity at the very least. Below that I wouldn’t even bother with replacing either disks or a whole subsystem, given the significant costs. The gain has to be large enough to make it worthwhile.

The 3ware had its disks upgraded once too, going from a RAID-6 array consisting of 8×1TB Hitachi Deskstars (~5.45TiB usable) to 8×2TB Hitachi Ultrastars (~10.91TiB usable), which is where I’m sitting at right now. All of this – my whole workstation – is installed in an ancient EYE-2020 server tower from the 90s, which so far has housed everything starting from my old Pentium II 300MHz with a Voodoo² SLI setup all the way up to my current Core i7 980X hexcore with a nVidia SLI subsystem. Talk about some long-lasting hardware right there. So here’s what the “Helios” RAID-6 array and that ugly piece of steel look like today, and please forgive me for not providing any pictures of the actual RAID controller or its battery backup unit, I don’t have any nice photos of them, so I have to point you to some web search regarding the 3ware 9650SE-8LPML, as always, please CTRL+click to enlarge:

As you can see, that makes 16 × 40mm fans. It’s not like server-class super noisy, but it for sure ain’t silent either. It’s quite amazing that the Y.S. Tech fans in there have survived running 24/7 from 2003 to 2015, that’s a whopping 12 years! They are noisier now, and every few weeks one of the bearings would go to saw-blade mode for a brief moment, but what can you expect. None have died so far, so that’s a win in my book for any consumer hardware (which the HDCS was).

Thing is, I have two of those 3ware RAID controllers now, but each one has issues. One wouldn’t properly synchronize on the PCIe bus, negotiating only a single PCIe lane, and that thing is PCIe v1.1 even, which means a 250MiB/s limit in that crippled mode. The second one syncs properly, but has a more pressing issue; Whenever there are sharp environmental temperature changes (opening the window for 5 minutes when it’s cool outside is enough), the controller randomly starts dropping drives from the array. It took me a LONG time to figure that out, as you probably can imagine. Must be some bad soldering spots on the board or something, but I couldn’t really identify any.

Plus, capacity is running out again. Now, the latest 3ware firmware would enable me to upgrade this to at least 8 × 6TB, but with 4K video coming up and with my desire to build something very long-lasting, I decided to retire “Helios”. Ah, yes. The name…

Consider me as being childish here, but naming is something very important for me, when it comes to machines and disks or arrays. ;) I had decided to name each array once per controller. For disk upgrades, it simply gets a new number. So there was the IDE one, “Polaris”. Then “Polaris 2”, then “Helios” and “Helios 2”.

The next one shall be called “Taranis”, named after an iconic vessel a player could fly in the game [EVE Online], and its own namesake, an ancient Celtic [god of thunder].

Supposedly, a famous Taranis pilot once said this:

“The taranis is a ship for angry men or people who prefer to deal in absolutes. None of that cissy boy, ‘we danced around a bit, shot some ammo then ran away LOL’, or, ‘I couldn’t break his tank so I left’, crap. It goes like this:

You fly Taranis. A fight starts. Someone dies.”

I flew on the wing of a Taranis pilot for only one single time. A lot of people died that night, including our entire wing! ;)

In any case, I wanted to 1up this a bit. From certain enterprise storage solutions I of course knew the concept of hot-swapping and more importantly error reporting LEDs on the front of a storage enclosure. Since that’s extremely useful, I wanted both for my new array in a DIY way. I also wanted to get rid of the Antec HDCS, which had served me for 12 years now, and ultimately also semi-retire my case, after understanding that it was just too cramped for this. A case that had served me for 17 years, 24/7.

Holy shit. That’s a long time!

So I had to come up with a good solution. The first part was: I needed hot-swap bays that could do error reporting in a way supported by at least some RAID controllers. I found only ONE aftermarket bay that would fully satisfy my requirements. The controller could come later, I would just pick it from a pool of controllers supporting the error LEDs of the cages.

It was the Chieftec SST-2131SAS ([link 1], [link 2]), the oldest of Chieftecs SAS/SATA bays. It had to be the old one, because the newer TLB and CBP series no longer have any hard disk error reporting capability built in for whatever reason, and on top of that, the older SST series shows much less plastic and just steel and what I think is magnesium alloy, feels awesome:

So there is no fancy digital I²C bus for error reporting on the bays, just some plain LED connectors that do require the whole system to have a common electrical ground to work for closing the circuit, as we only got cathode pins. I got myself four such bays, which makes for a total of 12 possible drives. As you may already be guessing, I’m going for more than just twice the capacity on this one.

For a fast, well-maintainable controller, I went for the Areca [ARC-1883ix-12], which was released just at the end of 2014. It supports both I²C as well as the old “just an error LED” solution my bays have, pretty nice!

Areca (and I can confirm this first-hand) is very well known for their excellent support, which means a lot of points have to go to them for that. Sure the Taiwanese Areca guys don’t speak perfect English, but given their technical competence, I can easily overlook that. And then they support a ton of operating systems, including XP x64, even after it’s [supposed] demise (The system shall run with a mirror of my current XP x64 setup at first, and either some Linux or FreeBSD UNIX later). This thing comes with a dual-core ROC (RAID-on-Chip) running at 1.2GHz, +20% when compared to its predecessor. Plus, you get 2GiB of cache, which is Reg. ECC DDR-III/1866. Let’s just show you a few pictures before going into detail:

So there are several things to notice here:

  1. It’s got an always-full-power fan and a big cooler, so it’s not going to run cool. Like, ever.
  2. It requires PCIe power! Why? Because all non-PEG devices sucking more than 35W have to, by PCIe specification. This one eats up to 37.2W (PEG meaning the “PCI Express Graphics” device class, graphics cards get 75W from the slot itself).
  3. It has Ethernet. Why? Because you need no management software. The management software runs completely *ON* the card itself!

The really interesting part of course is the Ethernet plug. In essence, the card runs a complete embedded operating system, including a web server to enable the administrator to manage it in an out-of-band way.

That means that a.) it can be managed on all operating systems even without a driver and b.) it can even be managed, when the host operating system has crashed fatally, or when the machine sits in the system BIOS or in DOS. Awesome!

Ok, but then, there is heat. The system mockup build I’m going to show you farther below was still built with the “lets plug it in the top PCIe x4 slot” idea in mind. That would include my EVGA GeForce GTX580 3GB Classified Ultra SLI system still being there, meaning that the controller would have to sit right above an extremely hot GPU.

By now, I’ve abandoned this idea for a thermally more viable solution, replacing the SLI with a GeForce GTX Titan Black I got for an acceptable price. In the former setup, the controllers many thermal probes have reported temperatures reaching 90°C during testing, and that’s without the GPUs even doing much, so yeah.

But before we get to the mockup system build, there is one more thing, and that’s the write cache backup for the RAID controller for cases of power failures. Typically, Lithium-Ion batteries are used for that, but I’m already a bit fed up with my 3ware batteries having gone belly-up every 2 years. So I wanted to ditch that. There are such battery backup units (“BBUs”) for the Areca, but it may also be combined with a so-called flash backup module (“FBM”). Typically, a BBU would keep the DRAM and its write cache alive on the controller during power outages for like maybe 24-48 hours, waiting for the main AC power to return. Then, the controller would flush the cached data to the disks to retain a consistent state.

An FBM does it differently: It uses capacitors instead, plus a small on-board SSD. It would keep the memory alive for just seconds, just enough to copy the data off the DRAM and onto its local SSD. Then it would power off entirely. The data gets fetched back after any arbitrary amount of downtime upon power-up of the system, and flushed out to the RAID disks. The hope here is, that the supercapacitors being used by such modules can survive for much longer than the LiOn batteries.

There is one additional issue though: Capacity (both in terms of electrical capacitance and SSD capacity) is limited by price and physical dimensions. So the FBM can only cover 2GiB of cache, but not the larger sizes of 4GiB or 8GiB.

That’s where Areca support came into play, readily helping you with any pre-purchase question. I talked to a guy there, and described my workload profile to him, which boils down to highly sequential I/O with relatively few parallel streams (~40% read + ~60% write), and very little random R/W. He told me that based on that use case, more cache doesn’t make sense, as that’d be useful only for highly random I/O profiles with a very high workload and high parallelism. Think busy web servers or mail servers. But for me, 4GiB or the maximum of 8GiB of cache wouldn’t do more than what the stock 2GiB does.

As such, I forgot about the cache upgrade idea and went with the flash backup module instead of a conventional BBU. That FBM is called the ARC-1883-CAP:

So, let’s put all we have for now together, and look at some build pictures:

Let me tell you one thing; Yes, the Lian Li PC-A79B is nice, because it’s so manageable. The floors in the HDD cages can be removed even, so that any HDD bay can fit, with no metal noses in the way in the wrong places. Its deep, long and generally reasonably spacious.

But – there is always a but – when you’re coming from an ancient steel monster like I did, the aluminium just feels like thin paper or maybe tin foil. The EYE-2020 can could the weight of a whole man standing on top of it. But with an aluminium tower you’ll have to be careful not to bend anything when just pulling out the mainboard tray. The HDD cage feels as if you could very easily rip it out entirely with just one hand.

Aluminium is really soft and weak for a case material, so that’s a big minus. But I can have a ton of drives, a much better cooling concept and a much, much, MUCH cleaner setup, hiding a lot of cables from the viewer and leaving room for air to move around. Because that part was already quite terrible in my old EYE.

Please note that the above pictures do not show the actual system as it’s supposed to look like in the end though. The RAID controller already moved one slot downwards, away from the 4 PCIe lanes coming from the ICH10R (“southbridge”), which in turn is connected to the IOH (“northbridge”) only via a 2GiB/s DMI v1 bus. So it went down one slot, onto the PCIe/PEG x16 slot which is connected to the X58 chipsets IOH directly. This should take care of any potential bandwidth problems, given that the ICH10R also has to route all my USB 2.0 ports, the LAN ports, all Intel SATA ports including my system SSD and the BD drives, one Marvell eSATA controller and one Marvell SAS Controller to the IOH and with it ultimately to the CPU & RAM, all via a bus that might’ve gotten a bit overcrowded when using a lot of those subsystems at once.

Also, this tiny Intel cooler isn’t gonna stay there, it just came “for free” with the second ASUS P6T Deluxe I bought, together with a Core i7 930. Well, as a matter of fact, that board… umm… let’s just say it had a little accident and had to be replaced *again*, but that’s a story for the next episode. ;) A Noctua NH-D15 monster and the free S1366 mounting kit that Noctua sends you if you need one, plus a proper power supply all have already arrived, so there might be a new post soon enough, with even more Noctuafication also being on the way! Well, as soon as I get out of my chair to actually get something done at least. ;)

And for those asking the obvious question “what drives are you gonna buy for this?”, the answer to that (or at least the current plan) is either the 6TB Seagate Enterprise Capacity 3.5 in their 4Kn version, the [ST6000NM0014], or the 6TB Hitachi Ultrastar 7K6000, also in their 4Kn version, that’d be the [HUS726060AL4210]. Given that I want drives with a read error rate of <1 error in 1015 bits read instead of <1 in 1014, like it is for consumer drives, those would be my primary drives of choice. Seagates cheap [SMR] (shingled magnetic recording) disks are completely unacceptable for me anyway, and from what I’ve heard so far, I can’t really trust Hitachis helium technology with being reliable either, so it all boils down to 6TB enterprise class drives with conventional air filling for now. That’s if there aren’t any dramatic changes in the next few months of course.

Those disks are all non-encrypting drives by the way, as encryption will likely be handled by the Areca controllers own AES256 ASIC and/or Truecrypt or Veracrypt.

Ah, I almost forgot, I’m not even done here yet. As I may get a low-air-pressure system in the end, with less air intake than exhaust, potentially sucking dust in everywhere, I’m going to filter or block dust wherever I possibly can. And the one big minus for the Chieftec bays is that they have no dust filters. And the machine sits in an environment with quite a lot of dust, so every hole has to be filtered or blocked, especially those that air gets sucked through directly, like the HDD bays.

For that I got myself some large 1×1 meter stainless steel filter roll off eBay. This filter has a tiny 0.2mm mesh aperture size and 0.12mm wire diameter, so it’s very, very fine. I think it was originally meant to filter water rather than air, but that doesn’t mean it can’t do the job. With that, I could get those bays properly modified. I don’t want them to become dust containers eventually after all.

See here:

Steel filter with 0.2mm mesh aperture

Steel filter with 0.2mm mesh aperture, coins for size comparison (10 Austrian shillings and 1 Euro).

I went for steel to have something easy enough to work with, yet still stable. Now, it took me an entire week to get this done properly, and that’s because it’s some really nasty work. First, let’s look at one of the trays that need filtering, so you can see why it’s troublesome:

So as you can see, I had to cut out many tiny pieces, that would then be glued into the tray front from the inside, for function as well as neat looks. This took more than ten man-hours for all 4 bays (12 trays), believe it or not. This is what it looks like:

Now that still leaves the other hexagonal holes in the bay frame, that air may get sucked through and into the bays inside. Naturally, we’ll have to handle them as well:

And here is our final product, I gotta say, it looks reaaal nice! And all you’d have to do every now and then is to go over the front with your vacuum cleaner, and you’re done:

SST-2131SAS, fully filtered by steel

A completed SST-2131, fully filtered by pure steel.

So yeah, that’s it for now, more to follow, including the new power supply, more dust filtering and blocking measures, all bays installed in the tower and so on and so forth…

Edit: [Part 2 is now ready]!

Feb 202015

Hard disk logoSince we have had hard drives, we’ve been trying to make them larger and larger, just like with any other data storage medium. I believe that for mechanical disks, no single parameter is more significant than sheer size. Recently Seagate blessed us with its first SMR (“Shingled magnetic recording”) hard drive, giving us a technology bringing more capacity again, but with a few drawbacks. Drawbacks significant enough to make me want to talk about SMR and its competing technologies today, and also about why SMR is here already, and the others are not.

1.) Steamroller tactics

As I said, we’ve always been trying to increase the storage space of disks. There are usually two ways of achieving this, of which some are particularly challenging. The easy method? Just cramp in more platters. And with the platters, more read/write heads. This clearly has its limits of course. Traditionally, the maximum number of platters that could be operated safely next to each other in a regular half-height 3.5″ HDD was 5. Keep in mind that for a 7200rpm drive, we have 120 platter rotations within a single second! Smaller 2.5″ platters as used in certain enterprise disks or the WD Raptor drive can spin even faster, at 10.000rpm or 15.000rpm which means 166 and 250 rotations per second respectively.

Rotational speeds so high mean there’s going to be a lot of air turbulence in there, and air is essential to keep the heads floating at very low altitudes (freaking nanometers for Christs sake!) over the platters. Too much disturbance and you get instabilities and potentially fatal head crashes.

A Hitachi 2TB 5-platter disk

A Hitachi DeskStar 7K2000 2TB, 5-platter, 10-head disk (click to enlarge)

In recent days, Hitachi Global Storage dared to replace air-based designs with low-density helium-filled drives, thus enabling them to pack an unbelievable amount of 7 platters into a normal 3.5″ drive, accessed by 14 heads, all made possible by lower gas turbulence and resistance. This also enabled them to use lower-powered, lighter motors to spin the platters. Seagate stroke back, by using shingled magnetic recording for inexpensive disks and six platters for enterprise disks – despite the conventional air filling. The reason why Seagate didn’t introduce SMR to the enterprise markets yet are said drawbacks. But there is no way we can pack more and more platters and heads into disks of the same volume – there is only so much space, and 7 platters is insane already.

2.) The easiest way out may not always be a smooth ride…

So what is SMR, and what made it appear on the markets well before its competing technologies, like HAMR (heat-assisted magnetic recording), MAMR (microwave-assisted magnetic recording) and BPMR (bit-patterned media recording)?

Basically, price made SMR happen.

But let’s just dive into the tech first.

As you may have heard, the term “shingled” is derived from roof shingles and the way they actually overlap when being put on a roof. That’s for gabled roofs of course, not flat ones. ;) Regular hard drives have sectors sitting in line on what we call a track. Then, there is a slight gap, and next to the track there is another track and so on. Like rings sitting on the disc. The sectors tend to be of the same physical size (in square area covered), which is why data can be read the fastest on the outermost parts of the platter – more sectors passing by the head for an equal amount of angular movement.

This is considered to waste space though. Modern read heads can read much more narrow tracks than what write heads are able to store safely on the disks magnetic film. Individual bits are stored using roughly [20-30 grains of magnetized material] on the disks film right now. So, ~20 grains with the same magnetic orientation, distributed amongst a few nanometers. It seems read heads can cope with less though, so the industry (or academia?!) came up with this:

SMR structual comparison

A structural comparison: a normal disk to the left, and shingles to the right. As one can see, the shingled magnetic recording disk allows for packing more data into the same space, despite individual tracks being wider when written.

So, yeah. No gap anymore, no more wasted space, right? The gap was actually there because the edges of the tracks are not too well defined by classic write heads, so reading safely could be tricky when you pack them too close together. The ones used on SMR disks aren’t any different technologically. They’re just wider, writing fatter tracks, which enables the head to write more well-defined track edges. Now the read head can pick up the narrower tracks even without any gaps between the tracks. Thus, we can pack stuff tighter and still read back without corruption. But…  how do we modify written data?! Each track looks like it’s been partially overwritten by its fat-ass successors? Lets see what would happen, if we attempted to write “the regular way” to a part of filled-up shingled disk surface, as compared to a normal one:

Writing to an SMR surface is a problem

Writing to an SMR surface is a problem. Writes within the structure overwrite adjacent tracks, because write heads are wider than read heads to be able to create strong enough magnetic fields for writing.

On the normal drive (to the left) we just write. No problem. On the SMR disk we write. And…  oops. Its not actually a write, it’s a write+destroy. In the example density above, by writing three sectors in directly sequential order (maybe a small 1.5kiB file), we effectively overwrite six additional sectors “to the right”, because the write head is too wide. For writing 1.5kiB, we potentially corrupt 3kiB more. Maybe even more when using 4kiB sectors instead of 512 byte ones. The effective amount of destroyed data depends on how much overlap there actually is of course – but there will always be data loss!

So how do we rewrite?! Well, we do it like SSDs do it in such a case – which is why SSDs also need the TRIM command and/or garbage collection algorithms. First, we need to read all affected data, which is basically everything down-cylinder from the write location (called “downstream” in SMR lingo). Reason is that if we rewrite just the six additional sectors right next to the affected ones, we lose the next six etc.


We read everything downstream into a cache memory in as many cycles as there are tracks downstream, then we write downstream in as many cycles as there are downstream, including the original write. This is known in a less extreme form from solid-state disks, as a “read-modify-write” cycle:

Classic HDD writes vs. read-modfiy-write on SMR disks

Classic HDD writes vs. read-modify-write on SMR disks

So, what does that mean? Lets sum up the operations:

Regular hard drive:

  • Wait for platter to rotate and seek head to first target sector in track
  • Write three sectors in direct succession

SMR hard drive:

  • Wait for platter to rotate and seek head to target track + 1
  • Read three sectors in direct succession, store in cache
  • Wait for platter to rotate and seek head to target track + 2
  • Read three sectors in direct succession, store in cache
  • Wait for platter to rotate and seek head to target track + n
  • Read three sectors in direct succession, store in cache
  • (Repeat until we hit end of medium* or band)
  • Seek head to target track
  • Write original three sectors
  • Wait for platter to rotate and seek head to target track + 1
  • Rewrite three previously stored sectors, recalled from cache
  • Wait for platter to rotate and seek head to target track + 2
  • Rewrite three previously stored sectors, recalled from cache
  • Wait for platter to rotate and seek head to target track + n
  • Rewrite three previously stored sectors, recalled from cache
  • (Repeat until we hit end of medium* or band)

*As you can see, this is crazy. If we write to the up-most sector, we have to rewrite all the way downstream, this could be millions of sectors and seeks for a small file, affecting the entire radius of the platter! This is why SMR doesn’t completely do away with track gaps. It’s just that tracks are now being grouped into bands of arbitrary size to limit the read-modify-write impact. Let’s have a look at two side-by-side 7-tracks-wide SMR “bands”, both being written to:

Shingled media - organized into bands

Shingled media – organized into bands 7 tracks wide

From this we can learn two things: Bands can mitigate the severity of the issue. Also, the amount of work depends on where within a band we write to. The farther downstream, the less latency hit we will have to endure, the less seeks and write overhead we’ll have. Bands can’t be too wide, as write performance would deteriorate too much. Bands can’t be too narrow either, or we’ll lose too much of the density advantage because we’d have more band gaps using up platter real estate.

Let’s look at an overview regarding band width efficiency:

SMD band width efficiency

SMD band width efficiency[1] (click to enlarge)

I won’t go into all the detail about what this means, but especially the part about reserved non-shingled storage on the disk is pretty much unusable in todays scenarios I believe.  So, please pay attention to the green lines, where f=0. The number r is simply the track width in nanometers. By looking at the graph we can learn, that the sweet spot for the number of tracks per band is maybe around 10 to 25 or so. Beyond that we don’t gain much by saving us band gaps, and below that, data per square area isn’t packed with enough increase in density.

This makes me think that Seagate went with a rather “low” band width with their current SMR drive (the [Seagate Archive HDD v2]), as the platter size increase was only +250GB, so from 1TB to 1.25TB for the first SMR generation, and then +333GB with 1.33GB platters in the final generation hitting the market. So they got to an areal density increase factor of just 1.33×, which may correspond to 6 tracks per band, maybe 8 or 10 depending on track width in nanometers (I do not have solid data on track widths of any modern drives, especially not Seagates SMR disks). Some rumors are saying “5 – 10 bands” which does seem right considering my math.

Probably bad enough, but hey.

As said, SMR disks – like SSDs – are not showing their inner structure to the operating system as-is, as this would require new I/O schedulers, file systems, applications[2] and so on. Instead, they’re going for a fully firmware-abstracted[3] approach, showing only a “normal hard drive” to the OS, like any SSD would do too. All the nasty stuff happens on the drive itself, implemented 100% in the drive firmware.

File systems also need to be considered. A file system that fragments quickly will scatter larger files all across the disk, potentially across a multitude of bands. Rewriting such a file on a file system that can’t do [copy-on-write] and is fragmented will likely be painful even with firmware optimizations and cached/delayed writes in place. Exposing SMR to the file system would help a lot, but would also mean a lot of work for the file system developers side, and I just don’t see that happening at least outside of seriously expensive large-scale systems. Current file systems like FAT, exFAT, NTFS, ReFS, EXT2/3/4, XFS, btrfs, ZFS, UFS/FFS and so on simply don’t understand SMR bands. To my knowledge there is no file system that would. It’s likely going to be handled like 512e Advanced format – all the magic happens below the fake layer the operating system is being presented.

2a.) Pricing

Now, in the beginning I said the main reason for SMR to appear right now is price. Thing is, with SMR you can still use the same platters, and mostly also the same heads as before, with just minor modifications like the wider write head. It’s much less of a radical hardware design challenge, and more of a data packing and organization solution saving a lot of money. To give you an idea, let’s compare some actual prices from the EU region as of 2015-02-20, (only drives actually in stock), source: [Geizhals]German flag. Drives compared are roughly targeting the same market or at least share as many properties as possible, like warranty periods, 24/7 qualification, URE ratings, etc.:

Some regular drives:

  • Western Digital Purple 6TB: 251,52€ (Price per GB: 4,19¢)
  • Western Digital Red 6TB: 263,75€ (Price per GB: 4,40¢)
  • Hitachi GST Deskstar NAS 6TB: 273,32€ (Price per GB: 4,56¢)
  • Hitachi GST Ultrastar He8 8TB: 660,95€ (Price per GB: 8,26¢. This is an Enterprise helium drive with 5 years warranty and an URE rating at <1 bit read error in 1015 reads, so it’s hardly comparable. It’s the only other 8TB drive available though, which is why I’d like to show it here too.)
  • Seagate Surveillance HDD 7200rpm 6TB: 395,14€ (Price per GB: 6,59¢)
  • Seagate NAS HDD 6TB: 394,44€ (Price per GB: 6,57¢)

SMR Drive:

  • Seagate Archive HDD v2 8TB: 259,–€ (Price per GB: 3,24¢)

So as you can see, the price per GB of the shingled disk is simply unmatched here. First of all, it has no direct competitors to run against it, and it’s much cheaper per GB than the 6TB disks of the competition! Heck, even its absolute price is lower in most cases. Seagate does state, that the drive is meant for specific work loads only though. In essence, the optimal way to use it would be to treat it as a WORM (write once, read many) medium, as read performance is not impacted by SMR. But how bad is it really?

2b.) Actual performance numbers

Those are extremely hard to come by at this stage, as there are no real reviews yet. All I could find are some inconclusive tests [here]German flag. Inconclusive, because no re-writes or overwrites were tested. So far all that can be said is that the disks seems to show more progressive write caching, which does make some sense for grouping data together so whole bands can be written at once. We’ll have to wait a bit longer for any in-depth analysis though.

If anything comes up, I will add the links here.

For now, let’s continue:

3.) SMR super-charged: 2-dimensional reading (TDMR)

I’ll be brief about this, as the basic ideas of 2D reading are “relatively” simple. Here we’re trying to make the tracks even narrower, up to a level where a single read head might run into issues when it comes to staying on track and maintaining data integrity on reading. The idea is to put two more read heads on the whole head, one that would read at position shifted slightly in direction of track n-1 and one that would read shifted to the opposite side, track n+1. Like this:

Multiple read heads

Multiple read heads for two-dimensional magnetic reading[4]

From the differences between the read heads’ acquired data, a more precise 2D map can be constructed, making it easier to decide what the actual data must be, and what’s just interference from nearby tracks. The end result of course being an even higher density and increased data integrity.

To save money, one could also retain the single read head setup and read the adjacent tracks in additional passes. Naturally, this would be much slower and possibly less safe.

TDMR readback

Two-dimensional magnetic readback[4]

From our current standpoint it is hard to tell how much more density can be gained by TDMR-enhanced SMR, and at what cost exactly. The basic problems of SMR aren’t solved by TDMR at all however, as the data is still being organized in shingled bands. For regular drives, TDMR doesn’t make much sense, as the written tracks are more than wide enough for a single read head anyway. This could be considered useful for a second or third generation SMR technology, if ever released. It may have other uses too however, see 5.

4.) The future #1: Heat/microwave-assisted magnetic recording (HAMR/MAMR)

There are technologies in the making that are aiming at increasing density by employing whole new technologies for reading and writing without necessarily tampering with the data organization on the platters. HAMR or the less-known MAMR are one of them.

I mentioned before, that current drives store a single bit by magnetizing about 20-30 individual grains on the actual surface of the disk. This number can not be reduced easily any longer for data integrity reasons, and the grain size can’t be reduced either, which is because of a magnetic limitation known as the super-paramagnetism wall[4]. In essence, you need a certain amount of energy to change the polarity of a bunch of grains. Due to this effect, the smaller the grains you have, the more energy you must concentrate on a smaller spot to make a write operation work.

As electromagnetic energy increases on extremely dense small grain tracks, the field becomes too strong and will affect nearby tracks, data might be lost. Also, the heads must shrink together with the tracks, and it becomes harder and harder and eventually impossible to even generate fields that strong with downscaled write heads.

The super-paramagnetic wall – or rather the grains themselves – have an interesting property though. The wall shifts with the materials’ temperature, because temperature affects the grains coercivity. The colder the surface becomes, the more energy is required to write to it. The hotter it becomes, the easier writing will be, with less powerful magnetic fields required.

Given that the wall can be pushed around at will by changing the materials coercivity, researchers came up with a few solutions based on Lasers and Microwave emitters. So basically, shoot the surface with a Laser or with Microwaves to heat it up with pinpoint accuracy, then write with a small heads weak field, and profit!

HAMR head with Laser emitter

HAMR head with Laser emitter[4]

MAMR head with microwave emitter

MAMR head with Microwave emitter[4]

Clearly, to make that happen, several challenges must be mastered. First of all, you’d want a disk surface which doesn’t mess with the Microwaves or get messed up by them. Alternatively, the surface must have properties that do not limit the effectiveness of the Laser. In both cases, it needs to house smaller grains. And on top of that, we need highly miniaturized Laser or Microwave emitters, and we’re talking about nanometer level here. Additionally, robustness has to be ensured, which might be an issue with the head permanently shooting Laser beams around or heating itself up with Microwaves.

This is why HAMR/MAMR development is extremely hard and extremely expensive. And not just development; The entire hard drive manufacturing process would need to change considerably, creating additional costs. None of this is true for SMR.

Naturally, a working HAMR/MAMR solution doesn’t need to mean the end of SMR. It may give us a way to keep pushing out large disks without SMR implications for certain professional markets and even larger ones using SMR for regular end users. Currently, it seems that HAMR is getting the most attention, and MAMR is likely not going to ever see the light of day.

Seagate HAMR prototype

Seagate has already demonstrated a working HAMR prototype. MAMR is nowhere to be seen. 20TB is the goal here according to Seagate.

5.) The future #2: BPMR (bit-patterned magnetic recording)

Another approach towards dealing with the density issue is to deposit the magnetic film layer and have the grains sit on nanolithographically created, elevated “islands”. When doing so, the grains show a strong coupling of their exchange energy, which is a quantum-physical effect. That coupling means, that grains will follow the magnetic orientation of their neighbors more willingly, and also all together stay in the same orientation. By doing this, the energy required for altering the magnetic orientation is proportional to the islands volume, not the volume of individual grains representing a single bit. So what we could do is use smaller grains. Or maybe even make the “1 grain per bit” dream a reality by putting a single grain onto each island of a bit-patterned medium.

Bit-patterned media

Bit-patterned medium[5]

So far it’s been said, that staggered BPMR would be the easiest to manufacture and might enable manufacturers to pack data even tighter together (think: hexagonal pattern alignment), although this might require two-dimensional magnetic reading again, so: three read heads to eliminate cross-read interference. Since TDMR wouldn’t imply SMR in this case as there is no shingling with BPMR, the two could be used together without any issues for the end user.

Staggered bit-patterned surfaces

Staggered bit-patterned surfaces[6]

Also, servo patterns are making progress, which is equally important as are the actual data islands to maintain proper head positioning:

A BPMR servo pattern to the right with non-staggered data tracks/islands to the left

A BPMR servo pattern to the right with non-staggered data tracks/islands to the left[7]

Even with all that goodness, the super-paramagnetic limit will still apply here, albeit in a reduced fashion. A BPMR head could however once more be equipped with a Laser, thus combining BPMR with HAMR technology to further downscale the islands. Throw staggered island tracks and triple read heads for two-dimensional reading into the mix, and the density possibilities grow even further.

Needless to say, this costs a truckload of money. BPMR would require even more drastic changes in how hard drives are being manufactured, as we’d need 20 – 10nm nanolithography technology, a different platter composition, and new write and read heads plus the software – or rather firmware – to properly control all of that.

6.) Conclusion

I don’t believe we’ll really see HAMR anytime soon. On top of that, SMR might not really be just as intermediate a technology, only filling the gap until we can do better and bring the Lasers to the storage battle. When you look at HAMR, it becomes clear that SMR or SMR+TDMR are still feasible in conjunction with HAMR. I believe what we’ll see there will be an even stronger segmentation of hard drive markets, where fast writers might be available only in certain expensive enterprise segments, and slower-(re)writing SMR based media might serve other markets like cold data / cloud storage and data archiving. HAMR would be enhancing both of them, whether SMR is on board or not. Anything with a considerably higher need for reads could be using SMR, for a long time to come.

HAMR or not, any disk featuring SMR will definitely be larger and also feature the implications discussed here.

Then comes BPMR, likely after all other technologies described here. Now this baby is something else. Shingling tracks is pretty much out of the question here, as grains and the bits they represent can no longer overlap, becoming essentially atomic. If BPMR ever hits the markets – and I’m guessing it will in 10-15 years – the time for SMR to disappear will have come. But thats so far into the future that calling SMR “intermediate” now might be a bit premature. I’m guessing it may stay for a decade or more after all, even if I consider it quite ugly by design.

People needed to pay increasing attention to what drives they were going to buy over the past years, and with SMR that will just intensify. Users will need to have a proper idea of their I/O workloads and each and every storage technology available to them to make good decisions so that nobody runs off and buys a bunch of SMR disks to build a random-write-heavy mail server with them.

Good thing I’m building my new RAID-6 soon enough to still get SMR-free drives that I can consider “huge” despite them being non-shingled. ;)

[1] Gibson, G.; Ganger, G. 2011. (Carnegie Mellon University Parallel Data Lab). “Principles of Operation for Shingled Disk Devices“.

[2] Dunn, M.; Feldman, T. 2014. (Seagate, Storage Networking Industry Association). “Shingled Magnetic Recording Models, Standardization and Applications“.

[3] Feldman, T.; Gibson, G. 2013. “Shingled Magnetic Recording“. Usenix ;login: vol. 38 no. 3, 2013-06.

[4] Wood, R. 2010. (Hitachi GST). “Shingled Magnetic Recording and Two-Dimensional Magnetic Recording“. IEEE SCV MagSoc, 2011-10-19.

[5] Toshiba. 2010. “Bit-Patterned Media for High-Density HDDs“. The 21st Magnetic Recording Conference (TMRC) 2010.

[6] A*Star. “Packing in six times more storage density with the help of table salt“.

[7] Original unscaled image is © Kim Lee, Seagate Technology.