Saturday, August 28, 2010

Adventures in the Linux storage stack

I recently got my Gentoo box back from storage after three months, and promptly upgraded all the software on it and tried to install a shiny new SSD. Here are the problems I've run into so far:

  • Media drive failed to mount
  • My media drive is a btrfs RAID-0 between two 500GB drives that I've picked up over the years. One of them wasn't being detected. After ruling out the usual suspects (drive was still working, as confirmed by smartctl, and btrfs' drive detection tool (btrfsctl -a) had been run [but oh god why does it even need this]), I started looking at more exotic causes.

    Diagnosis: The problem turned out to be a regression in the md userspace tools (must be, since I didn't upgrade the kernel): if you have a device which used to belong to an md RAID array, but doesn't anymore, md will still pick it up as a RAID device and lock it. The recommended fix is to zero out the first part of the drive, which isn't an option for me because the drive ACTUALLY HAS DATA ON IT. :[

  • pvmove stalls at 100%
  • When using pvmove to transfer my OS drive to my shiny new SSD, pvmove stalled at 100% completed. No data was being written, which I know thanks to gkrellm. Then my entire system locked up, or at least every process that tried to write data locked up. After rebooting, I tried to resume the transfer (which pvmove supports), and the system locked up again the same way.

    Diagnosis: None. Chalk this one up to LVM being completely fucked. It is a master of walking the thin line between being so useful that it can't be ignored, and so broken that it can't be used. (UPDATE: might have been caused by creating the pv on a whole disk and not a partition. Not sure why that'd be the problem, but it's the only thing I changed and now it works.)

  • Boot environment doesn't see shiny new SSD
  • So I have a completely customized boot process. It's part of a 70% successful experiment from about 6 months back in making an operating system which could survive a hard drive failure. (Harder than you'd think!) Part of that is a heavily customized initrd - basically a micro-OS which is embedded into the kernel image itself, so that I can do funky stuff with the boot process itself. This is all well and good, until I find out that while my OS can see my new hard drive fine, my initrd can't. :(

    Diagnosis: This one was me getting sloppy. When I created the initrd, I deleted a few thousand miscellaneous device nodes from the static /dev directory (because udev is too complex, let's just embed dev nodes for every piece of hardware that could possibly be connected -_-), and it turns out this included the one used by the drive I just added. I only discovered this after a half hour of "oh shit oh shit did I just transfer my entire OS to a DOA drive", of course.

No comments: