On 2013-10-11, at 12:32, "Michael Bloom"
<michael.bloom@trd2inc.com<mailto:michael.bloom@trd2inc.com>> wrote:
I've modified my iozone testing for now to just "-i0", i.e., write/rewrite.
Here's the command:
iozone -e -M -m -r 1m -s 1g -0 -+n -+A 2 -+u -C -t -P 0 -+d -F
$TESTDIR/iozone-file_1GB.a_001.$$ 1>$LOG.10gb.write_00 &
where $TESTDIR points to a directory in my Lustre FS whose striping config are (the files
are created new on each test run so they inherit the striping config from the directory):
lfs getstripe iozone-file_1GB.a_005.28844
iozone-file_1GB.a_005.28844
lmm_stripe_count: 2
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 1
obdidx objid objid group
1 617201 0x96af1 0
0 616584 0x96888 0
The file size isn't quite the 1GB I was looking for either:
-rw-r-----. 1 root root 471859200 Oct 11 01:33 iozone-file_1GB.a_005.28844
Only reason we're running pre-release stuff is that's what we downloaded. We
could upgrade but it didn't seem necessary to chase the releases. Of course, if there
are bugs hampering our efforts then of course we'd upgrade. Just hasn't seem to
be necessary yet.
What I meant was that you are running bleeding edge code, not that it was too old. 2.4.1
is the most recent maintenance release, though 2.5.0 (the next maintenance release is a
few weeks away).
I ran some more experiments after I posted my question. When I saw that
osc_cached_mb's associated busy_cnt was also pegged at some high number, I also
noticed that max_dirty_mb was set to the 32MB default (as are most/all of our config).
Thinking our write rate far exceeded the ability of 32MB to efficiently buffer the async
writes,
When I bumped osc_cached_mb up to 2GB(!) I no longer saw busy_cnt pegged either. However,
the test's throughput was still in the 50MB/sec range.
I would suspect either your OST is broken or the network. However, you wrote you got 5GB/s
for reads. Was that from the OST, or might it have been from the client cache?
You should unmount your broken OST and mount it with "-t ldiskfs" instead of
"-t lustre" and then run iozone against it directly. If that works, you need to
run LNET selftest on the network (this is described in the Lustre user manual) to see if
there is a network problem.
Then I noticed these in the single client's /var/log/messages:
kernel: LustreError: 28855:0:(osc_request.c:854:osc_announce_cached()) dirty 0 - dirty_max
2147483648<tel:2147483648>
too big???
That is a direct result of your changing osc_cached_mb and max_dirty_mb to 2GB, which is
per OST, and too high. We probably should fix that limit and message, since it isn't
outrageously high anymore, but in the meantime you need to make it a bit smaller ...
Oct 11 01:21:07 hadoop23 kernel: LustreError:
28855:0:(osc_request.c:854:osc_announce_cached()) Skipped 21370 previous similar
messages.
I see that osc_announce_cached logs this message when there are too many dirty pages.
Effectively, the underlying issue is still there: Either the OSS is out to lunch, not
committing these writes back to the client, or they aren't being drained by the OSS.
Lustre clients send write RPCs as soon as 1MB of dirty data, so it would only be stuck on
the client if the OST isn't accepting it fast enough.
Saw this on the associated OSS:
kernel: LustreError: 20316:0:(ofd_grant.c:607:ofd_grant()) lustrewt-OST0000: client
f4cc2e8f-022a-a927-e9d
1-b8540d7ad1a9/ffff880464abf800 requesting > 2GB grant
2147483648<tel:2147483648>.
I don't know why the grant is so big!
Same problem. Keep the clien tunables below 2GB.
Is there a knob on the OSS that's equivalent to the client's osc_cached_mb? That
is, is there a knob that defines the size of the OSS's buffer for receiving the
client's traffic?
Thanks,
Michael
On Fri, Oct 11, 2013 at 2:00 AM, Dilger, Andreas
<andreas.dilger@intel.com<mailto:andreas.dilger@intel.com>> wrote:
On 2013-10-10, at 12:16, "Michael Bloom"
<michael.bloom@trd2inc.com<mailto:michael.bloom@trd2inc.com>> wrote:
I'm looking for some help to understand why write throughput in
my Lustre IB cluster is only about 50-80 MB/sec, while read performance is 5-6 GB/sec.
We've gotten multi-GB/s with 2.4, so 50 MB/s is definitely not expected.
It isn't really possible to make any sensible guesses about your performance problem
without knowing what kind of writes you are doing.
It is also possible that your underlying storage is having problems. Did you try mounting
it locally and running iozone directly?
I'm running 2.4.1-RC2-PRISTINE on my MDT and 2 OSS's. Also
using 2.4.92 on my 16 clients, 3 of which run iozone write throughput tests.
Is there any reason to be running the pre-release code on the clients? Testing is great,
and the 2.5 client should have some improved performance over 2.4, but that isn't
necessarily production ready yet.
Cheers, Andreas
I noticed a few threads in the 2.4.1 RC1 and RC2 timeframe discussing
low write performance. I noticed that curr_dirty_bytes start off at 0 at the start of the
test as one would expect. As the test proceeds, one OSS's curr_dirty_bytes stays
pegged at some huge number, implying it didn't see a commit. The other OSS's
curr_dirty_bytes varies during the test as iozone writes data that gets committed. What
can I look at to see why the commit isn't happening?
Thanks in advance,
Michael
_______________________________________________
HPDD-discuss mailing list
HPDD-discuss@lists.01.org<mailto:HPDD-discuss@lists.01.org>
https://lists.01.org/mailman/listinfo/hpdd-discuss