A week or so ago, in my weekly status email from my OpenSolaris NAS, I got an unfortunate notice.
status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
The last sentence is the welcome news that one gets from using cool things like ZFS. But, the first sentence strikes fear in the heart of someone who has nearly 3 Terabytes of data stored in that drive array. So, it was clearly time to replace the failing drive and get the array back up to full redundant goodness. The simple solution would have been just to replace the aging drive and be done with it. But since I am at about 90% capacity on the array, I figured I would take the opportunity to grow the size of the array by replacing not just the single failed 1 terabyte drive with a new one, but purchase 4 new 1.5 terabyte drives. I know there are 2 terabyte drives out there, but they are significantly more expensive and I didn’t want to go there.
Here is where I started losing the battle which later ensued. I was traveling in Boston at the time and decided to order the drives while I was on the road so that they would arrive soon after I returned and I could get things moving faster. I settled on the Western Digital Caviar Green WD15EARS 1.5TB 64MB Cache SATA 3.0Gb/s 3.5” Internal Hard Drive from Newegg. I liked the low power consumption, the large size, and the cool running. However, I did not do my homework on Western Digitals new “Advanced Format” feature.
This feature changes the atomic storage unit on the drive from 512 bytes to 4 kilobytes. In the abstract this is good. It means that the drive itself is more efficient and needs less overhead on the platter to store error correction information (ECC) resulting in more storage space. However, it means that if you ever try to write out less than 4 kilobytes of data to the drive, it needs to read the entire 4 kilobytes from the drive platter, “merge” in your new data, recalculate the ECC, and write it back out. Again, doesn’t sound bad. But if your operating system assumes that the disk block size is 512 bytes, it always writes in chunks that are aligned on 512 byte boundaries, not 4 kilobyte boundaries, resulting in TONS of read-modify-write cycles, absolutely killing throughput.
Guess what assumption OpenSolaris makes? If you guessed 4 kilobytes, you are wrong, and don’t know Murphy too well. 512 bytes it is. So, while the drive will work, it’s performance will absolutely suck. I mean bad. The system was estimating it would take 4.5 DAYS to move about 800 gigabytes of data onto the new drive. And while 800 gigabytes is a lot of data, it should not take that long.
So, turns out I was wrong, Solaris does just fine with the drive, it was just my assumption of speed that was wrong. I will not go into the day long gnashing of teeth that occurred as I tried different things and tested various drives on the machine. In the end, I convinced myself that the drive was operating acceptably and that it would in fact take 3 days or so to do the resilvering.
But, my efforts also brought out the kindness of the internet and old friends. I posted a frustrated tweet and a good friend of mine from college, Monique, emailed me saying that her husband was an expert in Solaris, ZFS, and hard drives and asking if they could help. Eric and I emailed back and forth for much of the day, me sharing what I had done and thoughts I had about the problem, and he helping me with ideas and questions. I can’t thank them enough and I certainly owe each of them a few beers (or other beverage of choice) next time I see them.
But, in all of this, I realized the frankenstein machine I have built for this data is, well, a frankenstein of a machine. And as this has become my central data store at home, I decided to engage in some retail therapy and have purchased the parts for a new, better computer to house the system. This one will have room for 9 drives, if I ever get up to that point, and a new motherboard/CPU/power supply to more comfortably handle the system. It will no longer be shoehorned into a Dell case that was not designed to hold that many drives. So, I am looking forward to building my new system when all the parts arrive later this week (hopefully). If I think of it, I will take pictures and post a montage.