Patrick & Matin,
Thank you for responses. Files are Gau-*.rwf files, they are data files created by
Gaussian and a bunch of text or log files. Gaussian does what they call it as logically
random-access.
Error that we run into often is: Erroneous read. Read 1 instead of 8
I don't seem to find any lustre/lnet errors in dmesg or syslogs. Although I have seen
" Error -2 syncing data on lock cancel " here and there but they tend to be also
on other clients so I don't think that is of significance.
We are mounting Lustre over IB, not sure if that has any significance.
I am looking at runing another test with additional debugging enabled, if you have any
hints here that would be helpful.
Best Regards,
Amit
-----Original Message-----
From: Martin Hecht [mailto:hecht@hlrs.de]
Sent: Friday, September 04, 2015 11:15 AM
To: Patrick Farrell; Kumar, Amit; hpdd-discuss(a)lists.01.org
Subject: Re: [HPDD-discuss] Stale File is it possible?
maybe a flaky network connection, or the network over which lustre is
mounted gets flooded by Gaussians own communication? Are there any
lustre or lnet messages in dmesg and syslog which could be related to
such an event?
On 09/04/2015 06:08 PM, Patrick Farrell wrote:
> There's no concept or expectation of staleness in Lustre. It's likely
something else is going on.
>
> What do the actual files look like?
> ________________________________________
> From: HPDD-discuss [hpdd-discuss-bounces(a)lists.01.org] on behalf of
Kumar, Amit [ahkumar(a)mail.smu.edu]
> Sent: Friday, September 04, 2015 10:32 AM
> To: hpdd-discuss(a)lists.01.org
> Subject: [HPDD-discuss] Stale File is it possible?
>
> Dear All,
>
> We are running lot of Gaussian(I am sure everyone knows this but for
clarity, it is a computational chemistry package) jobs that run for month or 2
and we have noticed that these jobs write to open files today and then next
time it would write to this file will be in 1-2-3 weeks, on our diskless cluster-
compute nodes.
> Theoretically this should not be a problem. But since we are running into a
problem where Gaussian is complaining about: Erroneous read on file. I am
thinking could the luster client/server evict any of information preserved after
sometime because on no activity? Is there a timeout setting on the client that
I could tweak to make the application resilient to getting access to the files
and not behave as if it were stale. Just trying to understand any help here is
greatly appreciated.
>
> Best regards,
> Amit
>