Still trying to isolate this issue. Ran lnet_selftest script that uses lst
brw write to send packets from a client to one of my two OSS's. Abysmal
write performance. Here's a snippet:
[LNet Rates of servers]
[R] Avg: 10983 RPC/s Min: 10983 RPC/s Max: 10983 RPC/s
[W] Avg: 10984 RPC/s Min: 10984 RPC/s Max: 10984 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 5492.14 MB/s Min: 5492.14 MB/s Max: 5492.14 MB/s
[W] Avg: 0.84 MB/s Min: 0.84 MB/s Max: 0.84 MB/s
[LNet Rates of servers]
[R] Avg: 11017 RPC/s Min: 11017 RPC/s Max: 11017 RPC/s
[W] Avg: 11017 RPC/s Min: 11017 RPC/s Max: 11017 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 5508.94 MB/s Min: 5508.94 MB/s Max: 5508.94 MB/s
[W] Avg: 0.84 MB/s Min: 0.84 MB/s Max: 0.84 MB/s
[LNet Rates of servers]
[R] Avg: 11010 RPC/s Min: 11010 RPC/s Max: 11010 RPC/s
[W] Avg: 11010 RPC/s Min: 11010 RPC/s Max: 11010 RPC/s
After test:
lctl --net o2ib conn_list
10.100.100.17@o2ib mtu 4096
10.100.100.18@o2ib mtu 4096
10.100.100.19@o2ib mtu 4096
10.100.100.24@o2ib mtu 4096
10.100.100.24@o2ib mtu 4096
But the IB mtu is 65520. Why are the path mtu's so small compared to the
IB mtu? If only sending 4k at a time over IB, or if IB writes are delayed
until multiple packets fill up the IB mtu, then this could delay the
writes. Looking to see how I can increase the path mtu's.
Thanks,
Michael
On Sat, Oct 12, 2013 at 5:49 PM, Michael Bloom <michael.bloom(a)trd2inc.com>wrote:
Should clarify what I just said. I could remount the OST with -t
ldiskfs
but the client wouldn't accept it.
# mount -t ldiskfs 10.100.100.17@o2ib:/lustrewt /mnt/lustre/
mount: unknown filesystem type 'ldiskfs'
# mount -t lustre 10.100.100.17@o2ib:/lustrewt /mnt/lustre
On Sat, Oct 12, 2013 at 3:49 AM, Dilger, Andreas <andreas.dilger(a)intel.com
> wrote:
> On 2013-10-11, at 12:32, "Michael Bloom" <michael.bloom(a)trd2inc.com
> <mailto:michael.bloom@trd2inc.com>> wrote:
>
> I've modified my iozone testing for now to just "-i0", i.e.,
> write/rewrite. Here's the command:
> iozone -e -M -m -r 1m -s 1g -0 -+n -+A 2 -+u -C -t -P 0 -+d -F
> $TESTDIR/iozone-file_1GB.a_001.$$ 1>$LOG.10gb.write_00 &
>
> where $TESTDIR points to a directory in my Lustre FS whose striping
> config are (the files are created new on each test run so they inherit the
> striping config from the directory):
>
> lfs getstripe iozone-file_1GB.a_005.28844
> iozone-file_1GB.a_005.28844
> lmm_stripe_count: 2
> lmm_stripe_size: 1048576
> lmm_pattern: 1
> lmm_layout_gen: 0
> lmm_stripe_offset: 1
> obdidx objid objid group
> 1 617201 0x96af1 0
> 0 616584 0x96888 0
> The file size isn't quite the 1GB I was looking for either:
> -rw-r-----. 1 root root 471859200 Oct 11 01:33 iozone-file_1GB.a_005.28844
>
>
> Only reason we're running pre-release stuff is that's what we downloaded.
> We could upgrade but it didn't seem necessary to chase the releases. Of
> course, if there are bugs hampering our efforts then of course we'd
> upgrade. Just hasn't seem to be necessary yet.
>
> What I meant was that you are running bleeding edge code, not that it was
> too old. 2.4.1 is the most recent maintenance release, though 2.5.0 (the
> next maintenance release is a few weeks away).
>
> I ran some more experiments after I posted my question. When I saw that
> osc_cached_mb's associated busy_cnt was also pegged at some high number, I
> also noticed that max_dirty_mb was set to the 32MB default (as are most/all
> of our config). Thinking our write rate far exceeded the ability of 32MB
> to efficiently buffer the async writes,
>
> When I bumped osc_cached_mb up to 2GB(!) I no longer saw busy_cnt pegged
> either. However, the test's throughput was still in the 50MB/sec range.
>
> I would suspect either your OST is broken or the network. However, you
> wrote you got 5GB/s for reads. Was that from the OST, or might it have been
> from the client cache?
>
> You should unmount your broken OST and mount it with "-t ldiskfs" instead
> of "-t lustre" and then run iozone against it directly. If that works,
you
> need to run LNET selftest on the network (this is described in the Lustre
> user manual) to see if there is a network problem.
>
> Then I noticed these in the single client's /var/log/messages:
> kernel: LustreError: 28855:0:(osc_request.c:854:osc_announce_cached())
> dirty 0 - dirty_max 2147483648<tel:2147483648>
> too big???
>
> That is a direct result of your changing osc_cached_mb and max_dirty_mb
> to 2GB, which is per OST, and too high. We probably should fix that limit
> and message, since it isn't outrageously high anymore, but in the meantime
> you need to make it a bit smaller ...
>
> Oct 11 01:21:07 hadoop23 kernel: LustreError:
> 28855:0:(osc_request.c:854:osc_announce_cached()) Skipped 21370 previous
> similar messages.
>
> I see that osc_announce_cached logs this message when there are too many
> dirty pages. Effectively, the underlying issue is still there: Either the
> OSS is out to lunch, not committing these writes back to the client, or
> they aren't being drained by the OSS.
>
> Lustre clients send write RPCs as soon as 1MB of dirty data, so it would
> only be stuck on the client if the OST isn't accepting it fast enough.
>
> Saw this on the associated OSS:
> kernel: LustreError: 20316:0:(ofd_grant.c:607:ofd_grant())
> lustrewt-OST0000: client f4cc2e8f-022a-a927-e9d
> 1-b8540d7ad1a9/ffff880464abf800 requesting > 2GB grant 2147483648<tel:
> 2147483648>.
> I don't know why the grant is so big!
>
> Same problem. Keep the clien tunables below 2GB.
>
> Is there a knob on the OSS that's equivalent to the client's
> osc_cached_mb? That is, is there a knob that defines the size of the OSS's
> buffer for receiving the client's traffic?
>
> Thanks,
> Michael
>
>
> On Fri, Oct 11, 2013 at 2:00 AM, Dilger, Andreas <
> andreas.dilger@intel.com<mailto:andreas.dilger@intel.com>> wrote:
> On 2013-10-10, at 12:16, "Michael Bloom" <michael.bloom(a)trd2inc.com
> <mailto:michael.bloom@trd2inc.com>> wrote:
>
> > I'm looking for some help to understand why write throughput in my
> Lustre IB cluster is only about 50-80 MB/sec, while read performance is 5-6
> GB/sec.
>
> We've gotten multi-GB/s with 2.4, so 50 MB/s is definitely not expected.
>
> It isn't really possible to make any sensible guesses about your
> performance problem without knowing what kind of writes you are doing.
>
> It is also possible that your underlying storage is having problems. Did
> you try mounting it locally and running iozone directly?
>
> > I'm running 2.4.1-RC2-PRISTINE on my MDT and 2 OSS's. Also using
> 2.4.92 on my 16 clients, 3 of which run iozone write throughput tests.
>
> Is there any reason to be running the pre-release code on the clients?
> Testing is great, and the 2.5 client should have some improved performance
> over 2.4, but that isn't necessarily production ready yet.
>
> Cheers, Andreas
>
> > I noticed a few threads in the 2.4.1 RC1 and RC2 timeframe discussing
> low write performance. I noticed that curr_dirty_bytes start off at 0 at
> the start of the test as one would expect. As the test proceeds, one OSS's
> curr_dirty_bytes stays pegged at some huge number, implying it didn't see a
> commit. The other OSS's curr_dirty_bytes varies during the test as iozone
> writes data that gets committed. What can I look at to see why the commit
> isn't happening?
> >
> > Thanks in advance,
> > Michael
> > _______________________________________________
> > HPDD-discuss mailing list
> > HPDD-discuss@lists.01.org<mailto:HPDD-discuss@lists.01.org>
> >
https://lists.01.org/mailman/listinfo/hpdd-discuss
>
>