Thanks, Andreas. If its as simple as unmounting the OST and just
remounting it as type ldiskfs, this didn't work. As root, and referencing
the software raid device represented by /dev/md127, here's what I got:
mount -t ldiskfs /dev/md127 /ostoss_mount/
mount: wrong fs type, bad option, bad superblock on /dev/md127,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
Thanks,
Mike
On Sat, Oct 12, 2013 at 3:49 AM, Dilger, Andreas
<andreas.dilger(a)intel.com>wrote:
On 2013-10-11, at 12:32, "Michael Bloom"
<michael.bloom(a)trd2inc.com
<mailto:michael.bloom@trd2inc.com>> wrote:
I've modified my iozone testing for now to just "-i0", i.e.,
write/rewrite. Here's the command:
iozone -e -M -m -r 1m -s 1g -0 -+n -+A 2 -+u -C -t -P 0 -+d -F
$TESTDIR/iozone-file_1GB.a_001.$$ 1>$LOG.10gb.write_00 &
where $TESTDIR points to a directory in my Lustre FS whose striping config
are (the files are created new on each test run so they inherit the
striping config from the directory):
lfs getstripe iozone-file_1GB.a_005.28844
iozone-file_1GB.a_005.28844
lmm_stripe_count: 2
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 1
obdidx objid objid group
1 617201 0x96af1 0
0 616584 0x96888 0
The file size isn't quite the 1GB I was looking for either:
-rw-r-----. 1 root root 471859200 Oct 11 01:33 iozone-file_1GB.a_005.28844
Only reason we're running pre-release stuff is that's what we downloaded.
We could upgrade but it didn't seem necessary to chase the releases. Of
course, if there are bugs hampering our efforts then of course we'd
upgrade. Just hasn't seem to be necessary yet.
What I meant was that you are running bleeding edge code, not that it was
too old. 2.4.1 is the most recent maintenance release, though 2.5.0 (the
next maintenance release is a few weeks away).
I ran some more experiments after I posted my question. When I saw that
osc_cached_mb's associated busy_cnt was also pegged at some high number, I
also noticed that max_dirty_mb was set to the 32MB default (as are most/all
of our config). Thinking our write rate far exceeded the ability of 32MB
to efficiently buffer the async writes,
When I bumped osc_cached_mb up to 2GB(!) I no longer saw busy_cnt pegged
either. However, the test's throughput was still in the 50MB/sec range.
I would suspect either your OST is broken or the network. However, you
wrote you got 5GB/s for reads. Was that from the OST, or might it have been
from the client cache?
You should unmount your broken OST and mount it with "-t ldiskfs" instead
of "-t lustre" and then run iozone against it directly. If that works, you
need to run LNET selftest on the network (this is described in the Lustre
user manual) to see if there is a network problem.
Then I noticed these in the single client's /var/log/messages:
kernel: LustreError: 28855:0:(osc_request.c:854:osc_announce_cached())
dirty 0 - dirty_max 2147483648<tel:2147483648>
too big???
That is a direct result of your changing osc_cached_mb and max_dirty_mb to
2GB, which is per OST, and too high. We probably should fix that limit and
message, since it isn't outrageously high anymore, but in the meantime you
need to make it a bit smaller ...
Oct 11 01:21:07 hadoop23 kernel: LustreError:
28855:0:(osc_request.c:854:osc_announce_cached()) Skipped 21370 previous
similar messages.
I see that osc_announce_cached logs this message when there are too many
dirty pages. Effectively, the underlying issue is still there: Either the
OSS is out to lunch, not committing these writes back to the client, or
they aren't being drained by the OSS.
Lustre clients send write RPCs as soon as 1MB of dirty data, so it would
only be stuck on the client if the OST isn't accepting it fast enough.
Saw this on the associated OSS:
kernel: LustreError: 20316:0:(ofd_grant.c:607:ofd_grant())
lustrewt-OST0000: client f4cc2e8f-022a-a927-e9d
1-b8540d7ad1a9/ffff880464abf800 requesting > 2GB grant 2147483648<tel:
2147483648>.
I don't know why the grant is so big!
Same problem. Keep the clien tunables below 2GB.
Is there a knob on the OSS that's equivalent to the client's
osc_cached_mb? That is, is there a knob that defines the size of the OSS's
buffer for receiving the client's traffic?
Thanks,
Michael
On Fri, Oct 11, 2013 at 2:00 AM, Dilger, Andreas <andreas.dilger(a)intel.com
<mailto:andreas.dilger@intel.com>> wrote:
On 2013-10-10, at 12:16, "Michael Bloom" <michael.bloom(a)trd2inc.com
<mailto:michael.bloom@trd2inc.com>> wrote:
> I'm looking for some help to understand why write throughput in my
Lustre IB cluster is only about 50-80 MB/sec, while read performance is 5-6
GB/sec.
We've gotten multi-GB/s with 2.4, so 50 MB/s is definitely not expected.
It isn't really possible to make any sensible guesses about your
performance problem without knowing what kind of writes you are doing.
It is also possible that your underlying storage is having problems. Did
you try mounting it locally and running iozone directly?
> I'm running 2.4.1-RC2-PRISTINE on my MDT and 2 OSS's. Also using 2.4.92
on my 16 clients, 3 of which run iozone write throughput tests.
Is there any reason to be running the pre-release code on the clients?
Testing is great, and the 2.5 client should have some improved performance
over 2.4, but that isn't necessarily production ready yet.
Cheers, Andreas
> I noticed a few threads in the 2.4.1 RC1 and RC2 timeframe discussing
low write performance. I noticed that curr_dirty_bytes start off at 0 at
the start of the test as one would expect. As the test proceeds, one OSS's
curr_dirty_bytes stays pegged at some huge number, implying it didn't see a
commit. The other OSS's curr_dirty_bytes varies during the test as iozone
writes data that gets committed. What can I look at to see why the commit
isn't happening?
>
> Thanks in advance,
> Michael
> _______________________________________________
> HPDD-discuss mailing list
> HPDD-discuss@lists.01.org<mailto:HPDD-discuss@lists.01.org>
>
https://lists.01.org/mailman/listinfo/hpdd-discuss