how not to use linux dm-raid5

Here’s an excellent way of losing hours of work for an officeful of people.

First, have your main fileserver lose a disk. Since it’s configured optimistically, let there be no hot-spares, and let the SCSI backplane drag six other drives off to neverland.

Second, figure out which is the dead drive. Do this largely under remote control, since remote sysadmins are in charge of the local recovery operation (?!).

Third, pull the dead drive. Since it was part of a RAID5 set, the machine should have been happy to resume working in a non-redundant but functional mode. But something mysterious happens, and instead of doing this and letting others in the office resume their work ….

Fourth, initiate a RAID5 resync over the remaining drives, pretending that the dead drive was never in the array. For those following along at home, this multi-hour operation explicitly corrupts any of the remaining redundant data, rendering the whole array useless.

Fifth, while this resync is going, avoid checking the result by mounting the filesystem, or even running fsck on it. (This is entirely possible, and would let one see the results of #4). Instead, wait out the whole resync period, then notice … oh, golly, the data is lost!

Sixth, decide to start over, from tape backups. Add some spare or whatnot drive into the array. Resync it again. Look for tape backups, which might be at a remote site too — I don’t know. The people whose work got disrupted several hours ago might as well go home for the day.

Seventh, under no circumstances use this opportunity to improve the redundancy or capacity of this server during this forced multi-hour outage.

Eighth, let the tape backup robot die during recovery – which is only partly surprising as a full restore of this system has never been attempted. Oh well, what’s another few hours downtime. Oops, discover that the tape drive won’t work without cleaning, and we’re out of cleaning cartridges. Another day gone.

Ninth, find out that the sole tape drive that was formerly blocked on cleaning is now blocked on breaking. It’s deader than a doornail, and needs a replacement. Another day gone.

Tenth, after a test restore, find that the new tape drive actually … works. Restore the directories. Lose a bunch of the symbolic links that were also on that filesystem. Restore them individually, as people miss them. Then, after three hours of service (that’s three total in the last six days), the server dies again.

Eleventh, and the hits keep on coming, the scsi problem taking half the drives offline is back. RAID recovery is botched again, full restore starting over. Anyone who optimistically moved valuable files over to the server during the three-hour uptime is rewarded with BOHICA.

Finally, from all this, collect the lesson that a well-funded enterprise-level tape backup system would have been appropriate – instead of the lesson that low-cost high-redundancy DIY disk-to-disk backups are a good thing. The latter is not “enterprise level” and requires less budget, so it can’t be right.

Note: this story is in no way representative of an actual event, real or imaginary. Facts may be missing or inaccurate, and assumed not in evidence. No actual sysadmins would have ever done something like this. No electrons were created or destroyed during the events depicted in this entirely fictional story.

PostScript (2008-11): Subsequently, discover that the enterprise-level tape backup system ™ has been incorrectly configured to recover from the reboot of an NFS server, and thus fails to actually perform backups … for months … without anyone noticing. That is, until someone needed files restored.