maybe a flaky network connection, or the network over which lustre is
mounted gets flooded by Gaussians own communication? Are there any
lustre or lnet messages in dmesg and syslog which could be related to
such an event?
On 09/04/2015 06:08 PM, Patrick Farrell wrote:
There's no concept or expectation of staleness in Lustre.
It's likely something else is going on.
What do the actual files look like?
________________________________________
From: HPDD-discuss [hpdd-discuss-bounces(a)lists.01.org] on behalf of Kumar, Amit
[ahkumar(a)mail.smu.edu]
Sent: Friday, September 04, 2015 10:32 AM
To: hpdd-discuss(a)lists.01.org
Subject: [HPDD-discuss] Stale File is it possible?
Dear All,
We are running lot of Gaussian(I am sure everyone knows this but for clarity, it is a
computational chemistry package) jobs that run for month or 2 and we have noticed that
these jobs write to open files today and then next time it would write to this file will
be in 1-2-3 weeks, on our diskless cluster-compute nodes.
Theoretically this should not be a problem. But since we are running into a problem where
Gaussian is complaining about: Erroneous read on file. I am thinking could the luster
client/server evict any of information preserved after sometime because on no activity? Is
there a timeout setting on the client that I could tweak to make the application resilient
to getting access to the files and not behave as if it were stale. Just trying to
understand any help here is greatly appreciated.
Best regards,
Amit