On 2013/09/10 11:29 AM, "Edward Walter" <ewalter(a)cs.cmu.edu> wrote:
We're running into something we haven't seen before... our
OSTs (not
our MDT) seem to be out of inodes and we're seeing the following errors
in the OST logs:
> LustreError: 12393:0:(filter.c:3953:filter_precreate()) create failed
>rc = -28
> LustreError: 12393:0:(filter.c:3953:filter_precreate()) Skipped 3
>previous similar messages
Here's the space count for this filesystem:
> [root@warp e2fsprogs]# lfs df -i /lustre
> UUID Inodes IUsed IFree IUse% Mounted
>on
> data-MDT0000_UUID 159907840 44093755 115814085 28%
>/lustre[MDT:0]
> data-OST0000_UUID 7629312 7629312 0 100%
>/lustre[OST:0]
> data-OST0001_UUID 7629312 7629312 0 100%
>/lustre[OST:1]
>[snip]
> data-OST0016_UUID 7629312 7629312 0 100%
>/lustre[OST:22]
>
> filesystem summary: 159907840 44093755 115814085 28% /lustre
> [root@warp e2fsprogs]# lfs df -h /lustre
> UUID bytes Used Available Use% Mounted on
> data-MDT0000_UUID 228.7G 8.3G 205.2G 4%
>/lustre[MDT:0]
> data-OST0000_UUID 1.8T 646.2G 1.1T 37%
>/lustre[OST:0]
>[snip]
> data-OST0016_UUID 1.8T 643.7G 1.1T 36%
>/lustre[OST:22]
>
> filesystem summary: 41.8T 14.7T 25.0T 37% /lustre
We use a default stripe count of "4" so the numbers seem to add up:
4x 44,093,755 mdt_inodes / 23 OSTs = 7,668,479 inodes per OST
This seems to be in the ballpark of what the OSTs are actually reporting.
Does this make sense or am I misunderstanding what I'm seeing?
You understand correctly. Your filesystem is configured in what I would
consider a sub-optimal manner, namely that you have a default stripe count
of 4, but your average file size is only 14.7 TB / 44M inodes =
358kB/inode.
That means that every inode has 4 OST objects allocated (that need to be
accessed for every "ls" or other stat()), but only the first one has any
data in it (assuming the default 1MB stripe size) and the other three are
completely unused. This isn't really a usage scenario that we expected.
This filesystem was originally deployed with Lustre 1.8.4. We used
the
similar formatting options and stripecount (4) then. We consumed
91,269,360 inodes on the MDT (we ran it out of inodes) with that
filesystem without running out of inodes on the OSTs. Has something
changed with how the inode count is chosen for the OSTs since then?
Yes, the default mkfs parameters for the underlying OST and MDT were
changed in 1.8.7 via
http://review.whamcloud.com/959 in "increase default
inode ratio for OST/MDT" so that there weren't so many unused objects on
the OSTs. The huge overprovisioning of inodes on the OST had a significant
negative impact on mkfs and e2fsck performance. The current default for
OSTs over 1TB is one inode per 256kB of OST size, which was considered a
fairly safe margin for Lustre users (OSTs under 1TB assume a 64kB average
file size).
With a default stripe count of 4, you would need to have an average file
size of 4 * 256kB = 1MB.
We reformatted everything as part of the upgrade from 1.8.4 Here's
how
we formatted things for Lustre 2.1.4.
> for i in {b,c,e,f,g} ; do mkfs.lustre --reformat --fsname data --param
>ost.quota_type=ug --mgsnode=mdt-3-40.ib@o2ib,mdt-3-40.coma-ib@o2ib1
>--ost /dev/sd${i}1 ; done
And this is what we used for 1.8.4:
> mkfs.lustre --fsname=data --ost --reformat --mgsnode=172.16.1.3@o2ib
>--mountfsoptions="stripe=256,errors=remount-ro,extents,mballoc"
>--mkfsoptions="-E stride=32" /dev/sdd1
As you can see; neither of these explicitly change the inode count on
the OSTs (but 1.8.4 seemed to support our usage patterns).
Any pointers or suggestions as to what changed or what we missed would
be greatly appreciated. My suspicion is that we'll need to move our
data and reformat our OSTs to fix this though. Is that our only option
here?
Since the extra stripes for the small files are not helping you at all,
there IS something that you can do to get out of this predicament. The
tools to do this are better in Lustre 2.4 ("lfs find" allows checking the
stripe count directly, and "lfs_migrate" allows explicitly specifying the
stripe count of the file), but it should also be possible to do it with
the Lustre 2.1 versions:
# set the global default striping on the filesystem to 1
lfs setstripe -c 1 /lustre
lfs find /lustre -type f -size -4M -print | while read F; do
[ $(lfs getstripe -c "$F") -gt 1 ] && echo "$F"
done | lfs_migrate
Since you are 100% out of inodes on the OSTs, you will need to delete
some old files or temporarily move some files (maybe a couple hundred)
out of Lustre to at least get some space to work with. Once lfs_migrate
starts moving files from 4 stripes to the default 1 stripe, you will
free up about 3/4 of all the inodes on the OSTs.
Note that lfs_migrate in 2.1 is NOT safe to run on files that are being
modified by other applications, but is fine to run on files only being
read.
You may want to test on a subset of your files first, to determine how long
this will take (i.e. specify some subdirectory for "lfs find"). You can
use "lfs getstripe {file}" to check the before/after striping on the file.
Cheers, Andreas
--
Andreas Dilger
Lustre Software Architect
Intel High Performance Data Division