Andreas
Thank you for the reply. This investigation started with the
observation of slow backwards reads of file by an MSC.Nastran run doing
a Lanczos eigenvalue solve ( see image below ). I point that out so it
is known that I am not investigating an academic run of iozone.
It is far simpiler to work with iozone than MSC.Nastran.
If you care to read a bit more to see the observed behavior of Lustre,
please read on.
The following image depicts the access of the file over time, by the
iozone run. What is quite odd is that when the second backward read of
the file begins,
the reading of the file is at its fastest(steep slope) during this
backwards read. This is at at time when all of the end of the file
should have been LRU'd out of the system buffers by the previous
backwards read. The rate then slows down through the meat of the file
and then starts getting faster again toward the end of the second
backwards read.
I have run this job many times and the behavior, as depicted in the
first image, is always the same. The slopes vary some, but there is
always this
serpentine look to it. It is not the same OST's every time. If I run
this with iozone using 256K requests, the slopes for the backwards reads
gets much lower.
To me, it seems at though something is wrong with the LRU mechanism.
Note in the last image, when iozone is using 256k requests, that this
behavior starts during the
forward reads of the file. It is not just a backward read phenomenon.
It happens every time when reading backwards. Only occasionally during
the forward reads.
John
On 1/18/2015 10:29 PM, Dilger, Andreas wrote:
On Jan 18, 2015, at 17:19, John Bauer
<bauerj@iodoctors.com<mailto:bauerj@iodoctors.com>> wrote:
I have been observing what I would think is unexpected behavior. I will try to keep this
short, and start with the question.
Should it be expected, when sequentially reading a striped file multiple times, that the
data from some OST's remains in the system cache
while others does not?
This isn't something that I'm aware of myself, nor something I'd necessarily
expect. That said, this isn't actually a bad thing.
File is 80GB is size.
System has 64GB of memory.
File is striped 16 way, 1MB stripe size. Application is iozone.
File is written forwards twice, then read forwards twice, then read backwards twice.
There is 80GB / 16 stripes = 5GB of data per stripe. If the pages were handled in strict
LRU order, then one would expect the two forward reads to blow out the cache, and result
in 10GB of data read per stripe. Then, the first backwards read would access most of the
data from cache, maybe 60GB taking into account the OS, so 80GB - 60GB = 20GB read on the
first pass (1.25GB/stripe), and another full 5GB for the second backward read. That gives
16.25GB/stripe in the expected LRU case.
That you got 16-17GB read on many OSCs is expected. For the OSCs that had less read i
checked that the cached reads sum(16Gb - actual read) = 45GB or so, so it doesn't
exceed the amount that could have been cached.
I don't know why this might have happened, but there could be several causes. If one
of the LDLM locks was cancelled due to memory pressure, it would have allowed some data to
stay in cache for the first backward read, and by being accessed more than once it
wouldn't fall off the LRU for the second backward read.
Cheers, Andreas
Application request size is 1MB.
Run on the swan cluster at Cray, Inc.
lustre-cray_ari_s/2.5_3.0.101_0.31.1_1.0502.8394.10.1-1.0502.17198.8.51
The file is large enough to oversubscribe the system's memory. I would expect that
each OST would see uniform activity.
But that is far from the case. Here is the amount of data read by each OST during the
entire iozone job, ranges from 10G to 17G.
<afcjgffc.png>
When I look at how much data the OST's have read versus time, some have no activity
during the entire 2nd backwards read.
The OST's that have the low amount of data read also have very high application data
delivery rates during these same periods, indicating the data is in the system cache.
Is this to be expected?
Thanks
John
_______________________________________________
HPDD-discuss mailing list
HPDD-discuss@lists.01.org<mailto:HPDD-discuss@lists.01.org>
https://lists.01.org/mailman/listinfo/hpdd-discuss