Lee,
As I noted we are on lustre 1.8 (until three weeks)
We were able to recover the data here is what we did,
This was an old lustre filesystem based on sun x4500's/x4540 built with only software
raid5+spare based on the Tokyo Tech paper (old old, see a theme?)
We had a drive fail and a second drive throw read errors during the rebuild. We were able
to recover the data though.
Because the FS was down for a long time and that this was one of 37 OST's we used
from the 1.8 manual
23.3.5 Identifying a Missing OST
We deactivated the osc for that OST on each client this let the FS keep moving only
effecting files with stripes on that OST.
To recover that OST that was not a lustre specific thing, but in general we use ddrescue
to image the last drive kicked out to another drive, turns out we only had two unreadable
sectors, we were able to assemble the array with the new drive, and fsck -p didn't
even complain, which I expected, the rebuild though from there went fine and we
reactivated the osc connection on every host
lctl --device <deviceid> activate
And all is well again.
I took a lot from this page:
https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID
Specifically I did a lot of testing with the devicemaper snapshot devices to test if a
given solution worked without ever writing to the borked array until a test worked on the
overlays.
Great trick btw.
Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
brockp(a)umich.edu
(734)936-1985
On Aug 4, 2014, at 1:16 PM, Lee, Brett <brett.lee(a)intel.com> wrote:
Hi Brock,
The Lustre manual covers the different (temporarily failed, permanently failed)
scenarios.
https://wiki.hpdd.intel.com/display/PUB/Documentation
Chapter 14 - Lustre Maintenance in the 2.x manual
14.1 illustrates a mount option that may be what you're looking for:
"mount -o exclude=testfs-OST0000 ..."
Brett Lee
Solutions Architect, High Performance Data Division
> -----Original Message-----
> From: HPDD-discuss [mailto:hpdd-discuss-bounces@lists.01.org] On Behalf
> Of Brock Palen
> Sent: Monday, August 04, 2014 10:53 AM
> To: hpdd-discuss(a)lists.01.org
> Subject: [HPDD-discuss] removing dead OST,
>
> We just lost an OST failure in a legacy lustre 1.8 filesystem,
>
> How can one go about bringing the filesystem up without this OST?
>
> Thanks,
>
> Brock Palen
>
www.umich.edu/~brockp
> CAEN Advanced Computing
> XSEDE Campus Champion
> brockp(a)umich.edu
> (734)936-1985
>
>