How to tell which physical drives are failing in a degraded Proxmox ZFS pool
(Or any pool, really; doesn’t need to be Proxmox…)
Just recently, a server I commissioned 5 years ago began emailing me SMART drive errors. The RAIDZ-1 pool of four WD RED 3TB drives had a couple drives going bad at once — bad news indeed.
In a case like this, it’s important to ensure your backups are current (or make a backup in a hurry, if you haven’t already done so!) With peace of mind knowing that your data is safe even if you make a mis-move and blow up your zpool, you are supposed to be able to replace drives in a ZFS pool quite easily. But pulling a drive that is good will probably take down the whole storage pool; so how do you know which physical drive is failing? How do you identify each drive from its identifier in the ZFS pool?
1. View verbose zpool status
Firstly, view the status of your zfs pool with the “zpool status -v” command. Here’s what my output looked like. I know, not good! (note: if you have more than one pool, and you only want to display the status of one, specify the pool name: e.g. “zpool status -v dpool”.)
root@pve:~# zpool status -v dpool pool: dpool state: DEGRADED status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://zfsonlinux.org/msg/ZFS-8000-8A scan: scrub repaired 1012K in 0 days 08:31:31 with 0 errors on Sun Oct 11 08:55:32 2020 config: NAME STATE READ WRITE CKSUM dpool DEGRADED 0 0 0 raidz1-0 DEGRADED 70 0 0 wwn-0x50014ee2b667195e DEGRADED 75 0 0 too many errors wwn-0x50014ee2b6662089 ONLINE 0 0 0 wwn-0x50014ee2b666d820 FAULTED 24 0 0 too many errors wwn-0x50014ee2b665e775 ONLINE 0 0 0 logs ata-Samsung_SSD_860_EVO_250GB_S3YHNX0M117855W-part1 ONLINE 0 0 0 cache sdb2 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: dpool/vms/vm-100-disk-0:<0x1>
Note the drive identifiers ZFS is using, each beginning with “wwn-“. Popping open the cover of my Proxmox host and viewing the drives gets me the serial numbers of each:
2. Correlate physical drives to their identifiers in the zpool
So without shutting down the server and pulling drives to view their big label stickers, how do we find out which of the drives are failing? Which drive serial numbers correspond to each WWN identifier in the zpool? Turns out it’s easy with smartctl. Let’s find out which disk is DEGRADED in the pool above:
root@pve:~# smartctl -a /dev/disk/by-id/wwn-0x50014ee2b667195e | grep Serial Serial Number: WD-WCC4N3VR2V07
All right, a quick compare with the photo of the drives reveals that’s the bottom drive. Next, which one is FAULTED?
root@pve:~# smartctl -a /dev/disk/by-id/wwn-0x50014ee2b666d820 | grep Serial Serial Number: WD-WCC4N4HUCHLP
OK, so that’s the second drive down from the top. Now we know exactly which drives are causing the issue and need to be replaced. Actually, after 5 years of 24/7 service, hmmmm… likely they all ought to be replaced! In my case I took a backup, destroyed the pool altogether, and started over with a mirror of two 4TB drives. (I am coming to prefer mirrors over RAID-Z arrays.) For the curious, I opted for Seagate IronWolf drives this time; I’ll see how they treat me.
Obviously there are other methods to arrive at the same place. You can simply run “smartctl -a” on each drive in turn, looking for the Serial number and WWN number for each drive among the information that is printed to screen:
smartctl -a /dev/sda
and so forth. Whatever best suits your style!
Of course, physically pulling the drive and inspecting the top sticker will also reveal serial number, WWN identifier, and so forth. But if all you can see is the serial number, and you want to find out which drive is failing without shutting down the server to pull drives, this method will get you there. And if your pool happens to list its drives by their /dev/sdx identifiers, you’ll for sure need help to figure out which physical devices are specified in pool config. (e.g. “smartctl -a /dev/sdc | grep WWN” or “smartctl -a /dev/sdd | grep Serial”, etc)