Hi all,
One of our OSS crashed this week on our site. As a good practice, we run e2fsck on each
ost before
mounting it. E2fsck don't return any errors so we mount all the osts. After that,
some users are
complaining about not been able to access some of their files or they have a read error
when they
tried to open it.
We decided to stop Lustre and re-run the e2fsck on all osts. On the OSS where we ran
e2fsck
previously, there were many errors that were not detected at the first run:
--------------------------------------------------------------------
Pass 1: Checking inodes, blocks, and sizes
HTREE directory inode 25133057 has an invalid root node.
Clear HTree index? yes
Inode 25133057, i_size is 1540096, should be 1634304. Fix? yes
Inode 25133057, i_blocks is 3016, should be 16. Fix? yes
HTREE directory inode 25133058 has an invalid root node.
Clear HTree index? yes
Inode 25133058, i_size is 1540096, should be 1634304. Fix? yes
Inode 25133058, i_blocks is 3016, should be 16. Fix? yes
HTREE directory inode 25133059 has an invalid root node.
.....
---------------------------------------------------------------------
After the filesystem check, we recover Lustre OST objects in lost+found with
ll_recover_lost_found_objs. We remounted all the osts and a large number of files still
cannot be
accessible.
After doing some web search, I found that the latest version of e2fsprogs
(
http://e2fsprogs.sourceforge.net/e2fsprogs-release.html#1.42.9) has fixed a
larger number of bugs in e2fsck to correctly handle 64-bit file systems:
"Fixed a large number of bugs in resize2fs, e2fsck, debugfs, and libext2fs to
correctly handle
bigalloc and 64-bit file systems. There were many corner cases that had not been noticed
in previous
uses of these file systems, since they are not as common. Some of the bugs could cause
file system
corruption or data loss, so users of 64-bit or bigalloc file systems are strongly urged to
upgrade
to e2fsprogs 1.42.9."
Our ost size is 32TB (10disks x 4TB in RAID6), so the 64bit filesystem feature was enabled
when we
created the ost.
Knowing that the current version of e2progs available for Lustre is 1.42.7.wc2, is it
possible that
the e2fsck from Lustre may caused the file system corruption or data loss because we have
an ost
size > 16TB ?
Is there are way to retrieve inaccessible files ?
Our Lustre version is 2.1.6. We were previously on 2.1.3. Each ost is used at 71% (20TB
of 32TB)
Regards,
Minh,