On 2013-10-11, at 12:32, "Michael Bloom" <michael.bloom@trd2inc.com<mailto:michael.bloom@trd2inc.com>> wrote:What I meant was that you are running bleeding edge code, not that it was too old. 2.4.1 is the most recent maintenance release, though 2.5.0 (the next maintenance release is a few weeks away).
I've modified my iozone testing for now to just "-i0", i.e., write/rewrite. Here's the command:
iozone -e -M -m -r 1m -s 1g -0 -+n -+A 2 -+u -C -t -P 0 -+d -F $TESTDIR/iozone-file_1GB.a_001.$$ 1>$LOG.10gb.write_00 &
where $TESTDIR points to a directory in my Lustre FS whose striping config are (the files are created new on each test run so they inherit the striping config from the directory):
lfs getstripe iozone-file_1GB.a_005.28844
iozone-file_1GB.a_005.28844
lmm_stripe_count: 2
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 1
obdidx objid objid group
1 617201 0x96af1 0
0 616584 0x96888 0
The file size isn't quite the 1GB I was looking for either:
-rw-r-----. 1 root root 471859200 Oct 11 01:33 iozone-file_1GB.a_005.28844
Only reason we're running pre-release stuff is that's what we downloaded. We could upgrade but it didn't seem necessary to chase the releases. Of course, if there are bugs hampering our efforts then of course we'd upgrade. Just hasn't seem to be necessary yet.
I would suspect either your OST is broken or the network. However, you wrote you got 5GB/s for reads. Was that from the OST, or might it have been from the client cache?
I ran some more experiments after I posted my question. When I saw that osc_cached_mb's associated busy_cnt was also pegged at some high number, I also noticed that max_dirty_mb was set to the 32MB default (as are most/all of our config). Thinking our write rate far exceeded the ability of 32MB to efficiently buffer the async writes,
When I bumped osc_cached_mb up to 2GB(!) I no longer saw busy_cnt pegged either. However, the test's throughput was still in the 50MB/sec range.
You should unmount your broken OST and mount it with "-t ldiskfs" instead of "-t lustre" and then run iozone against it directly. If that works, you need to run LNET selftest on the network (this is described in the Lustre user manual) to see if there is a network problem.
kernel: LustreError: 28855:0:(osc_request.c:854:osc_announce_cached()) dirty 0 - dirty_max 2147483648<tel:2147483648>
Then I noticed these in the single client's /var/log/messages:
too big???
That is a direct result of your changing osc_cached_mb and max_dirty_mb to 2GB, which is per OST, and too high. We probably should fix that limit and message, since it isn't outrageously high anymore, but in the meantime you need to make it a bit smaller ...
Lustre clients send write RPCs as soon as 1MB of dirty data, so it would only be stuck on the client if the OST isn't accepting it fast enough.
Oct 11 01:21:07 hadoop23 kernel: LustreError: 28855:0:(osc_request.c:854:osc_announce_cached()) Skipped 21370 previous similar messages.
I see that osc_announce_cached logs this message when there are too many dirty pages. Effectively, the underlying issue is still there: Either the OSS is out to lunch, not committing these writes back to the client, or they aren't being drained by the OSS.
1-b8540d7ad1a9/ffff880464abf800 requesting > 2GB grant 2147483648<tel:2147483648>.
Saw this on the associated OSS:
kernel: LustreError: 20316:0:(ofd_grant.c:607:ofd_grant()) lustrewt-OST0000: client f4cc2e8f-022a-a927-e9d
I don't know why the grant is so big!Same problem. Keep the clien tunables below 2GB.
Is there a knob on the OSS that's equivalent to the client's osc_cached_mb? That is, is there a knob that defines the size of the OSS's buffer for receiving the client's traffic?
Thanks,
Michael
On Fri, Oct 11, 2013 at 2:00 AM, Dilger, Andreas <andreas.dilger@intel.com<mailto:andreas.dilger@intel.com>> wrote:> HPDD-discuss@lists.01.org<mailto:HPDD-discuss@lists.01.org>
On 2013-10-10, at 12:16, "Michael Bloom" <michael.bloom@trd2inc.com<mailto:michael.bloom@trd2inc.com>> wrote:
> I'm looking for some help to understand why write throughput in my Lustre IB cluster is only about 50-80 MB/sec, while read performance is 5-6 GB/sec.
We've gotten multi-GB/s with 2.4, so 50 MB/s is definitely not expected.
It isn't really possible to make any sensible guesses about your performance problem without knowing what kind of writes you are doing.
It is also possible that your underlying storage is having problems. Did you try mounting it locally and running iozone directly?
> I'm running 2.4.1-RC2-PRISTINE on my MDT and 2 OSS's. Also using 2.4.92 on my 16 clients, 3 of which run iozone write throughput tests.
Is there any reason to be running the pre-release code on the clients? Testing is great, and the 2.5 client should have some improved performance over 2.4, but that isn't necessarily production ready yet.
Cheers, Andreas
> I noticed a few threads in the 2.4.1 RC1 and RC2 timeframe discussing low write performance. I noticed that curr_dirty_bytes start off at 0 at the start of the test as one would expect. As the test proceeds, one OSS's curr_dirty_bytes stays pegged at some huge number, implying it didn't see a commit. The other OSS's curr_dirty_bytes varies during the test as iozone writes data that gets committed. What can I look at to see why the commit isn't happening?
>
> Thanks in advance,
> Michael
> _______________________________________________
> HPDD-discuss mailing list
> https://lists.01.org/mailman/listinfo/hpdd-discuss