Home     frequal.com

Recovering From a RAID Disk Failure

These are the steps I followed to recover from a RAID drive failure. Please be careful if you choose to use these -- if you don't put in your devices properly, you could cause total failure of your RAID and lose all of your data.

You may need to install sgdisk, I did this by installing the "gdisk" package.

For me, my second RAID drive failed very soon after the first. These are both SSD's. Since SSDs fail at the same rate given similar usage, RAID algorithms may have to change for SSDs to prevent simultaneous failure and total RAID failure.

* Capture "mdadm --detail" to backup
* Capture "/proc/mdstat" to backup
* Capture "mount" to backup
** /dev/md1 is /

* Get list of serial numbers to backup

** hdparm -I  | grep "Serial Number:"
* Stop the failing drive,  if it can be identified.  (Record the
*serial number)
** mdadm --manage /dev/md1 --fail /dev/sdb2
** mdadm --manage /dev/md1 --remove /dev/sdb2

* Remove the failed drive
* Pull out the failing drive and replace it with a new one

* Comment out /boot-backup from /etc/fstab

* Add the new drive to the array
** Partition it the same:
*** Doesn't work: sfdisk -d /dev/sda | sfdisk /dev/sdb
*** sgdisk -R /dev/sdb /dev/sda
*** sgdisk -G /dev/sdb

** Add it to the RAID
*** mdadm --manage /dev/md1 --add /dev/sdb2

** Monitor restore progress
*** cat /proc/mdstat

* Copy /boot from the backup to the new
** Find the boot partition
** Format the partition on the new drive
*** mkfs.ext4 /dev/sdb3
** Copy contents over
*** mount /dev/sdb3 /path-to-mounted-boot-partition
*** cp -a /path-to-boot-backup/* /path-to-mounted-boot-partition
* Copy the MBR to the new drive
** Included in sgdisk -R above (sgdisk copies the 'protective' MBR and the GPT)

Last modified on 16 Sep 2012 by AO

Copyright © 2016 Andrew Oliver