Replace Failed Storage Pool HDD

Replacing a failed hard disk drive in Windows Server Storage Spaces Storage Pool.

In my day to day jobby, I manage multiple Hyper-V clusters, and these clusters use shared storage. Most of the storage is a JBOD shelf with dual SAS connections to the hypervisor nodes. Because the shelf is JBOD, Windows is the brains for the storage. Enter Windows Storage Spaces. This is basically a Software RAID that you configure a pool of disks, and then you layer a VHD over that pool. After a volume is created on the VHD, it can be used by the hosts. These volumes can be added to Cluster Shared Volumes which allows the volume to move between host nodes, creating redundancy. All the VMs hosted in the cluster, live on the CSV, making them Highly Available.

As hardware always does, physical disks in the pool fail. Most of the Microsoft docs on fixing this issue are quite cumbersome to follow, and even more so, they tend to deal with Storage Spaces Direct, which is not the same technology. So I thought I would make up a quick little walk through on replacing a failed HDD in a storage pool. I have to do this about once a month, so this is just as much for me, as it is for you.

In my configuration, I run Windows Server Core exclusively for Hypervisors, so we will tackling this with PowerShell, tbh, not really even sure how to do this with a GUI.

Identifying a failed HDD

Identifying a failed HDD is pretty simple, it can be done with a simple PowerShell command:

Get-PhysicalDisk | ? HealthStatus -ne "Healthy"

This will show you any disks that have failed, I typically use a 3-way mirror on my Storage Pools, so if that number is greater then 2, then well, might as well just start over. Anyway, if the Usage property of the failed disk is not marked as Retired, you will need to mark it as such:

Get-PhysicalDisk | ? HealthStatus -ne "Healthy" | Set-PhysicalDisk -Usage Retired

So, this next part tends to vary between JBOD models, but to actually identify the failed HDD for physical replacement, there are a few things you can do, first off, you can actually trigger the slots LED to indicate the failed drive:

# This command assumes you already completed the previous command and 'Retired' the failed disks.
Get-PhysicalDisk -Usage Retired | Enable-PhysicalDiskIdentification

You can also get the physical SlotNumber:

Get-PhysicalDisk -Usage Retired | select SlotNumber

However, this number is 0 indexed so when looking at the JBOD shelf, be sure to start counting at 0.

Replacing the HDD

After you have identified the failed HDD and marked it as retired, you can now physically replace the drive. In my experience, after the new drive is inserted and running Get-PhysicalDisk the new drive always hangs in a Starting Operational Status. And because of this you cannot add it to the storage pool (notice the CanPool Property will be False). This can be cleared up by Resetting the disk:

Get-PhysicalDisk | ? OperationalStatus -eq "Starting" | Reset-PhysicalDisk

After this is complete, running the Get-PhysicalDisk will show the disk is Healthy and the CanPool property will be True. Now, you need to identify the Storage Pool’s Friendly Name, this can be found by running the Get-StoragePool command. Once you have the Friendly Name, you can add the new disk to the pool:

Add-PhysicalDisk -StoragePoolFriendlyName MyStoragePool -PhysicalDisks ( Get-PhysicalDisk -CanPool:$true )

Now that the drive is replaced, go ahead and repair the virtual disk:

Get-VirtualDisk | ? HealthStatus -ne "Healthy" | Repair-VirtualDisk

This will take some time depending on how large the virtual disk is, resiliency settings, and hardware, but there will be a progress bar. Or you can add the -AsJob switch to the Repair-VirtualDisk command and let it run in the background. You can fetch the status with Get-StorageJob.

Be sure to disable the LED identification:

# I always just run this on against all the disks, it won't hurt anything for the ones where its not enabled.
Get-PhysicalDisk | Disable-PhysicalDiskIdentification

After the repair is complete, the last step is to remove the failed drive from the pool. Remember, the repair has to fully complete before removing the failed drive from the pool, else data loss will occur.

Remove-PhysicalDisk -StoragePoolFriendlyName MyStoragePool -PhysicalDisks ( Get-PhysicalDisk -Usage Retired )

And thats it, good to go! Cheers.

Tags: powershell  windows  server 
Written on August 5, 2019