Thanks, Andreas.
I've modified my iozone testing for now to just "-i0", i.e., write/rewrite. Here's the command:
iozone -e -M -m -r 1m -s 1g -0 -+n -+A 2 -+u -C -t -P 0 -+d -F $TESTDIR/iozone-file_1GB.a_001.$$ 1>$LOG.10gb.write_00 &
where $TESTDIR points to a directory in my Lustre FS whose striping config are (the files are created new on each test run so they inherit the striping config from the directory):
lfs getstripe iozone-file_1GB.a_005.28844
iozone-file_1GB.a_005.28844
lmm_stripe_count: 2
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 1
obdidx objid objid group
1 617201 0x96af1 0
0 616584 0x96888 0
The file size isn't quite the 1GB I was looking for either:
-rw-r-----. 1 root root 471859200 Oct 11 01:33 iozone-file_1GB.a_005.28844
Only reason we're running pre-release stuff is that's what we downloaded. We could upgrade but it didn't seem necessary to chase the releases. Of course, if there are bugs hampering our efforts then of course we'd upgrade. Just hasn't seem to be necessary yet.
I ran some more experiments after I posted my question. When I saw that osc_cached_mb's associated busy_cnt was also pegged at some high number, I also noticed that max_dirty_mb was set to the 32MB default (as are most/all of our config). Thinking our write rate far exceeded the ability of 32MB to efficiently buffer the async writes,
When I bumped osc_cached_mb up to 2GB(!) I no longer saw busy_cnt pegged either. However, the test's throughput was still in the 50MB/sec range. Then I noticed these in the single client's /var/log/messages:
kernel: LustreError: 28855:0:(osc_request.c:854:osc_announce_cached()) dirty 0 - dirty_max
2147483648too big???
Oct 11 01:21:07 hadoop23 kernel: LustreError: 28855:0:(osc_request.c:854:osc_announce_cached()) Skipped 21370 previous similar messages.
I see that osc_announce_cached logs this message when there are too many dirty pages. Effectively, the underlying issue is still there: Either the OSS is out to lunch, not committing these writes back to the client, or they aren't being drained by the OSS.
Saw this on the associated OSS:
kernel: LustreError: 20316:0:(ofd_grant.c:607:ofd_grant()) lustrewt-OST0000: client f4cc2e8f-022a-a927-e9d
1-b8540d7ad1a9/ffff880464abf800 requesting > 2GB grant
2147483648.
I don't know why the grant is so big!
Is there a knob on the OSS that's equivalent to the client's osc_cached_mb? That is, is there a knob that defines the size of the OSS's buffer for receiving the client's traffic?
Thanks,
Michael