NAS4FREE. Issues with a failed drive in RAIDZ1, formatted wrong drive, then completely failed ZPOOL. Recovering, then reconfiguring a backup plan to USB drives.
So my Christmas present for last year was 8.6TB of data almost being lost on my NAS4FREE server. It hosts a lot of photos, ISOs (both MS, Games, Linux and others) and other files. A big portion of the files are backups of family PCs which again contain lots of photos.
I was about to go out fishing and swimming, but thought to check my server before I did. Firstly, I noticed the zpool was degraded. Can be checked in the GUI via disks, ZFS, Information, or command line via “zpool status xx” where xx is the zpool name.
I thought I had a working replacement drive, but it turns out it was actually failing as well. Long story short, this is what happened.
I tried to replace the drive by unplugging the old, plugging in the new and then running the standard “zpool replace olddrive newdrive”. This failed. Other commands I tried to use like detach also failed. I went to format the new drive thinking there might be an issue with it, but it turns out doofus formatted the wrong drive. Now I thought this shouldn’t be possible if it’s in a working RAIDZ1, but turns out it is…
After hours and hours of trying different things including testdisk and other zfs label recovery software, but none worked. I looked at the partition layout of the two working drives and they were identical. I remember seeing this being very similar, if not the same as the other two disks. Now at this point (10 hours after the server failed completely), I had nothing to lose I figured. I did have some of the data backed up, but the last few months hadn’t been updated. An external I was using for backups had issues, and then died completely.
I researched how to backup a partition layout and restore it to another disk… http://askubuntu.com/questions/57908/how-can-i-quickly-copy-a-gpt-partition-scheme-from-one-hard-drive-to-another gave me the answer I needed. Basically I had an external USB docking station, and an old laptop with Ubuntu on it. So I backed up the partition of a working disk, and the disk I stupidly formatted, and then restored the working disk to the stupid one.
sudo apt-get install sgdisk
Plugged in the working disk.
sgdisk --backup=table /dev/sdX
Plugged in the dodgily partitioned disk
sgdisk --backup=table /dev/sdX
sgdisk --load-backup=table /dev/sdX
You need to find out what the disk is. If I remember the easiest way to find it is to use the disk utility in Ubuntu. If you only have 1 disk, the second one you plug in should generally speaking be sdb.
If you don’t have an external dock, or another machine, you can use your desktop computer and an ubuntu live cd. For safety in terms of data etc, you should shut down your PC each time you remove/add a disk.
Anyway, did the above, plugged it back in and booted. SUCCESS! The RAIDZ1 pool now is in degraded state instead of failed/unavailable. It then took almost a full day to ‘resilver’ this disk. As you’d expect, it was boxing day, so no computer stores were open. Once I got the chance I bought a new drive, and a replacement USB drive. In the meantime I started researching GEOM vs GPTID. I still haven’t figured this out yet, so my zpool status shows some disks in GPT-ID/xxxxx and others in ada0p1 etc.
Once I replaced the disk successfully by offlining the disk that failed, removed it, then replaced it.
Commands from memory are (from http://askubuntu.com/questions/305830/replacing-a-dead-disk-in-a-zpool):
“zpool offline storage xxx” where xxx is the disk that’s showing unavailable in “zpool status storage” and where storage is my zpool name.
I then used “zpool replace storage xxx aaa” where xxx is the offlined disk and aaa is the new disk as showing under the GUI under disks, management.
This took about a day and a half to resilver after which the “zpool status storage” command showed online 😀
Now to backups. I was using rsync with USB drives formatted in UFS. Downside to this is that it’s slow, and CPU intensive. I decided to do lots of research and find a ‘better way’. Still one thing to note is that if you want to access files off these USB drives is that you will need a NAS4FREE (or BSD based) server with this ability to import ZFS disks.
Awesome scripts here we come thanks to Fritz! http://forums.nas4free.org/viewtopic.php?t=2197 or direct link https://github.com/fritz-hh/scripts_NAS4Free/
I now use the backupData.sh script. Basic use is that I placed the .sh file on my array and ran “./backupData.sh storage/Important usb4tb” after using “chmod u+x backupData.sh” to make it executabe. storage/Important is a dataset under my storage zpool. usb4TB is my 4TB USB drive that I setup as a single disk zpool by using “zpool create usb4tb da3” where da3 is my usb drive.
I’ve also setup SMART monitoring on each disk. Remember that each time you do a “clear and import disks” you have to activate smart on all of the drives again. This is where I went wrong previously.
I have now setup another .sh script which runs multiple ./backupData.sh scripts in a row to backup multiple datasets and usb drives. Now as you probably have figured, datasets are critical when it comes to this. A few years back I changed using a single zpool with all my data to using individual datasets within the zpool to separate data types. ie. I have photos, backups, Important etc as separate datasets. This means the sizes of these datasets I try to keep smaller than my USB drives to make sure I can backup these datasets to USB drives without having them fail.
Just a note with this; this uses snapshots, so be sure to have some auto snapshots happening. I use a recursive snapshot on storage so it snapshots all the datasets.