Sent this once already, but in reply to the wrong message...
Martin might know about that short read thing, since his site has a nice wiki page on it:
https://wickie.hlrs.de/platforms/index.php/Lustre_short_read
Technically Lustre is allowed to return fewer bytes than requested, as it says on that
page. But it doesn't normally - LU-6389 is a bug where that can happen kind of often.
(Again, it's technically allowed as that page says... But it shouldn't really
happen in practice, which is why LU-6389 is a bug.)
So perhaps Gaussian does not retry short reads? If memory serves, it's closed source,
so you can't check - but perhaps you could ask the vendor?
________________________________________
From: HPDD-discuss [hpdd-discuss-bounces(a)lists.01.org] on behalf of Patrick Farrell
[paf(a)cray.com]
Sent: Friday, September 04, 2015 4:03 PM
To: Kumar, Amit; Martin Hecht; hpdd-discuss(a)lists.01.org
Subject: Re: [HPDD-discuss] Stale File is it possible?
What's your Lustre version?
If the file looks OK when you look at it yourself (ie, no gaps), you might be running in
to this bug:
https://jira.hpdd.intel.com/browse/LU-6389
Lustre 2.5 and newer will sometimes return fewer than expected bytes on a read or write,
without giving an error.
- Patrick
________________________________________
From: Kumar, Amit [ahkumar(a)mail.smu.edu]
Sent: Friday, September 04, 2015 2:51 PM
To: Martin Hecht; Patrick Farrell; hpdd-discuss(a)lists.01.org
Subject: RE: [HPDD-discuss] Stale File is it possible?
Patrick & Matin,
Thank you for responses. Files are Gau-*.rwf files, they are data files created by
Gaussian and a bunch of text or log files. Gaussian does what they call it as logically
random-access.
Error that we run into often is: Erroneous read. Read 1 instead of 8
I don't seem to find any lustre/lnet errors in dmesg or syslogs. Although I have seen
" Error -2 syncing data on lock cancel " here and there but they tend to be also
on other clients so I don't think that is of significance.
We are mounting Lustre over IB, not sure if that has any significance.
I am looking at runing another test with additional debugging enabled, if you have any
hints here that would be helpful.
Best Regards,
Amit
-----Original Message-----
From: Martin Hecht [mailto:hecht@hlrs.de]
Sent: Friday, September 04, 2015 11:15 AM
To: Patrick Farrell; Kumar, Amit; hpdd-discuss(a)lists.01.org
Subject: Re: [HPDD-discuss] Stale File is it possible?
maybe a flaky network connection, or the network over which lustre is
mounted gets flooded by Gaussians own communication? Are there any
lustre or lnet messages in dmesg and syslog which could be related to
such an event?
On 09/04/2015 06:08 PM, Patrick Farrell wrote:
> There's no concept or expectation of staleness in Lustre. It's likely
something else is going on.
>
> What do the actual files look like?
> ________________________________________
> From: HPDD-discuss [hpdd-discuss-bounces(a)lists.01.org] on behalf of
Kumar, Amit [ahkumar(a)mail.smu.edu]
> Sent: Friday, September 04, 2015 10:32 AM
> To: hpdd-discuss(a)lists.01.org
> Subject: [HPDD-discuss] Stale File is it possible?
>
> Dear All,
>
> We are running lot of Gaussian(I am sure everyone knows this but for
clarity, it is a computational chemistry package) jobs that run for month or 2
and we have noticed that these jobs write to open files today and then next
time it would write to this file will be in 1-2-3 weeks, on our diskless cluster-
compute nodes.
> Theoretically this should not be a problem. But since we are running into a
problem where Gaussian is complaining about: Erroneous read on file. I am
thinking could the luster client/server evict any of information preserved after
sometime because on no activity? Is there a timeout setting on the client that
I could tweak to make the application resilient to getting access to the files
and not behave as if it were stale. Just trying to understand any help here is
greatly appreciated.
>
> Best regards,
> Amit
>
_______________________________________________
HPDD-discuss mailing list
HPDD-discuss(a)lists.01.org
https://lists.01.org/mailman/listinfo/hpdd-discuss