John,
A couple of questions:
1) Is the client dual-socket?
2) Is /proc/sys/vm/zone_reclaim_mode set to "1" on the client? If so, does the
client behavior change when this is set to "0"?
It might be a long shot, but I am wondering if there are NUMA issues involved that are
causing the memory to be unevenly used (and hence some data gets pushed out of cache
sooner than others because the OS is trying to use memory on a specific socket).
You could also try experimenting with altering the amount of data that the lustre client
will cache. You can cap this by running "lctl set_param
llite.*.max_cached_mb=X". By default, I believe it will limit itself to 3/4 of the
RAM, so I wouldn't expect your Lustre client to cache more than 48 GB at any given
time.
--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
Patrick,
I forgot to mention in the initial email that iozone is single process, single thread.
I don't think OSS caching is part of the issue. Perhaps I have a misunderstanding
of the workings of Lustre, but I understand that all of the I/O that gets to the OSS must
go through the OSC's. We just aren't seeing the I/O out of the OSC's as it is
found in the buffer cache.
I'll also point out that the total data read over LNET matches the sum of the data
read by the OSC's. I didn't include that plot as I had too many already. Thanks
again to you and Andreas on the /proc/sys/lnet stuff.
John
Sent from my iPhone
On Jan 19, 2015, at 9:59 AM, Patrick Farrell <paf(a)cray.com> wrote:
> John,
>
> This is at a little bit of a tangent to your questions (which make for very
interesting reading), but is IOZone single threaded, and have you considered OSS caching
as well?
>
> Cray observed a situation a bit like this, but with OSS caching, not client caching.
(I'm omitting any explanation of how we ruled out client caching here; you'll have
to trust me this effect was depending on OSS memory.)
>
> A multi-threaded write, then read of data sets slightly larger than available memory
(IE, data written/read per OSS slightly exceeded memory per OSS). We'd expect
something like a full LRU 'end around', where all the data is evicted from cache
without being used.
>
> We used some debugging code to track the progress of each thread within the file it
was reading. In fact, what we saw was that - due to apparent raciness on the client -
some threads would get slightly ahead of others, then suddenly leap ahead, dramatically
pulling away from the other threads and finishing quickly. It appeared as though those
threads got far enough ahead that they reached the portion of their data still in cache on
the OSS (since as they read in data, they evicted data from *all* the files/threads, not
just their own.)
>
> The last few threads, which never 'jumped', appeared, by their progress rate,
to be forced to read everything from disk.
>
> That may or may not be relevant here.
>
> As an interesting aside, we noticed that the large performance improvements in 2.6
significantly reduced the raciness of progress between multiple threads on the same
node... Which, bizarrely, had the impact in the pathological case described above
(write-then-read slightly more than available memory) of reducing performance. The
progress of the threads was much more orderly and even between the different threads,
which resulted in something closer to the full 'end-around' cache flush on the
read operations. (In general, 2.6 is much faster, outside of this pathological case.
> )
>
> - Patrick
>
> On 01/19/2015 07:59 AM, John Bauer wrote:
>> Andreas
>>
>> Thank you for the reply. This investigation started with the observation of slow
backwards reads of file by an MSC.Nastran run doing
>> a Lanczos eigenvalue solve ( see image below ). I point that out so it is known
that I am not investigating an academic run of iozone.
>> It is far simpiler to work with iozone than MSC.Nastran.
>>
>> If you care to read a bit more to see the observed behavior of Lustre, please
read on.
>>
>> The following image depicts the access of the file over time, by the iozone run.
What is quite odd is that when the second backward read of the file begins,
>> the reading of the file is at its fastest(steep slope) during this backwards
read. This is at at time when all of the end of the file should have been LRU'd out
of the system buffers by the previous backwards read. The rate then slows down through
the meat of the file and then starts getting faster again toward the end of the second
backwards read.
>>
>> I have run this job many times and the behavior, as depicted in the first image,
is always the same. The slopes vary some, but there is always this
>> serpentine look to it. It is not the same OST's every time. If I run this
with iozone using 256K requests, the slopes for the backwards reads gets much lower.
>> To me, it seems at though something is wrong with the LRU mechanism. Note in the
last image, when iozone is using 256k requests, that this behavior starts during the
>> forward reads of the file. It is not just a backward read phenomenon. It
happens every time when reading backwards. Only occasionally during the forward reads.
>>
>> John
>>
>> <mime-attachment.png>
>>
>> <mime-attachment.png>
>>
>> <mime-attachment.png>
>>
>>
>> <mime-attachment.png>
>>
>> <mime-attachment.png>
>>
>> <mime-attachment.png>
>>
>> On 1/18/2015 10:29 PM, Dilger, Andreas wrote:
>>> On Jan 18, 2015, at 17:19, John Bauer
<bauerj@iodoctors.com<mailto:bauerj@iodoctors.com>
>>> > wrote:
>>>
>>> I have been observing what I would think is unexpected behavior. I will try
to keep this short, and start with the question.
>>>
>>> Should it be expected, when sequentially reading a striped file multiple
times, that the data from some OST's remains in the system cache
>>> while others does not?
>>>
>>> This isn't something that I'm aware of myself, nor something I'd
necessarily expect. That said, this isn't actually a bad thing.
>>>
>>> File is 80GB is size.
>>> System has 64GB of memory.
>>> File is striped 16 way, 1MB stripe size. Application is iozone.
>>> File is written forwards twice, then read forwards twice, then read backwards
twice.
>>>
>>> There is 80GB / 16 stripes = 5GB of data per stripe. If the pages were
handled in strict LRU order, then one would expect the two forward reads to blow out the
cache, and result in 10GB of data read per stripe. Then, the first backwards read would
access most of the data from cache, maybe 60GB taking into account the OS, so 80GB - 60GB
= 20GB read on the first pass (1.25GB/stripe), and another full 5GB for the second
backward read. That gives 16.25GB/stripe in the expected LRU case.
>>>
>>> That you got 16-17GB read on many OSCs is expected. For the OSCs that had
less read i checked that the cached reads sum(16Gb - actual read) = 45GB or so, so it
doesn't exceed the amount that could have been cached.
>>>
>>> I don't know why this might have happened, but there could be several
causes. If one of the LDLM locks was cancelled due to memory pressure, it would have
allowed some data to stay in cache for the first backward read, and by being accessed more
than once it wouldn't fall off the LRU for the second backward read.
>>>
>>> Cheers, Andreas
>>>
>>> Application request size is 1MB.
>>> Run on the swan cluster at Cray, Inc.
lustre-cray_ari_s/2.5_3.0.101_0.31.1_1.0502.8394.10.1-1.0502.17198.8.51
>>>
>>> The file is large enough to oversubscribe the system's memory. I would
expect that each OST would see uniform activity.
>>> But that is far from the case. Here is the amount of data read by each OST
during the entire iozone job, ranges from 10G to 17G.
>>> <afcjgffc.png>
>>>
>>> When I look at how much data the OST's have read versus time, some have
no activity during the entire 2nd backwards read.
>>> The OST's that have the low amount of data read also have very high
application data delivery rates during these same periods, indicating the data is in the
system cache.
>>> Is this to be expected?
>>>
>>> Thanks
>>>
>>> John
>>>
>>>
>>>
>>> _______________________________________________
>>> HPDD-discuss mailing list
>>>
>>> HPDD-discuss@lists.01.org<mailto:HPDD-discuss@lists.01.org>
>>>
https://lists.01.org/mailman/listinfo/hpdd-discuss
>>>
>>>
>>>
>>
>>
>>
>> _______________________________________________
>> HPDD-discuss mailing list
>>
>> HPDD-discuss(a)lists.01.org
>>
https://lists.01.org/mailman/listinfo/hpdd-discuss
>
> _______________________________________________
> HPDD-discuss mailing list
> HPDD-discuss(a)lists.01.org
>
https://lists.01.org/mailman/listinfo/hpdd-discuss
>
_______________________________________________
HPDD-discuss mailing list
HPDD-discuss(a)lists.01.org
https://lists.01.org/mailman/listinfo/hpdd-discuss