One morning I was notified that a disk had failed. No big deal, this happens
now and then. I called Dell and next day I had a replacement disk. While
rebuilding, the replacement disk failed, and in the meantime another disk had
also failed. Now Dell’s support wisely suggested that I did not just replace
the failed disks as the array may have been punctured. Apparently, and as I
understand it, disks are only reported as failed when they have sufficiently
many bad blocks, and if you’re unlucky you can lose data if 3 corresponding
blocks on different disks become bad within a short time, so that the RAID
controller does not have a chance to detect the failures, recalculate the data
from the parity, and store it somewhere else. So even though only two drives
flashed red, data might have been lost.
Having almost used up the capacity we decided to order another storage
enclosure, copy the files from the old one to the new one, and then get the old
one into a trustworthy state and use it to extend the total capacity. Normally
I’d have copied/moved the files at block-level (eg. using dd or pvmove), but
suspecting bad blocks, I went for a file-level copy because then I’d know which
files contained the bad blocks. I browsed the net for other peoples’ experience
with copying many files and quickly decided that cp would do the job nicely.
Knowing that preserving the hardlinks would require bookkeeping of which files
have already been copied I also ordered 8 GB more RAM for the server and
configured more swap space.
When the new hardware had arrived I started the copying, and at first it
proceeded nicely at around 300-400 MB/s as measured with iotop. After a while
the speed decreased considerably, because most of the time was spent creating
hardlinks, and it takes time to ensure that the filesystem is always in a
consistent state. We use XFS, and we were probably suffering for not disabling
write barriers which can be done when the RAID controller has a write cache
with a trustworthy battery backup. As expected, the memory usage of the cp
command increased steadily and was soon in the gigabytes.