Awesome you all!! And sorry for the void in my response. Was out for extended long weekend
and catching up on things.
My version of Lustre is: 2.4.3 assuming LU-6389 also affects this version, this already
known fact could solve our mystery.
As Bernd pointed out, we compiled our own from source since we have license for that. I
will try the read-preload solution and see if we can relieve our users from this.
Exciting ..thank you for this hope ...
Best Regards,
Amit
-----Original Message-----
From: Bernd Schubert [mailto:bernd.schubert@fastmail.fm]
Sent: Monday, September 07, 2015 3:14 PM
To: Patrick Farrell; Kumar, Amit; Martin Hecht; hpdd-discuss(a)lists.01.org
Subject: Re: [HPDD-discuss] Stale File is it possible?
On 09/04/2015 11:09 PM, Patrick Farrell wrote:
> Sent this once already, but in reply to the wrong message...
>
> Martin might know about that short read thing, since his site has a nice wiki
page on it:
>
https://wickie.hlrs.de/platforms/index.php/Lustre_short_read
>
> Technically Lustre is allowed to return fewer bytes than requested, as
> it says on that page. But it doesn't normally - LU-6389 is a bug
> where that can happen kind of often. (Again, it's technically allowed
> as that page says... But it shouldn't really happen in practice,
> which is why LU-6389 is a bug.)
>
> So perhaps Gaussian does not retry short reads? If memory serves, it's
closed source, so you can't check - but perhaps you could ask the vendor?
Depends on the license. At least in the past it was common for to get the
Gaussian source code to be able to compile yourself or to modify it (have
been doing that myself a long time ago).
Short reads should be solvable using LD_PRELOAD. Although one might argue
that Gaussian as well as Lustre should be able to handle this any better on
their own.
Attached are rather untested read-preload files. Compile it and then run the
gaussian binary with something like
LD_PRELOAD=<path/to/file>/read-preload.so <binary>
Btw, the wiki code is not ideal, assuming already the first read returns -1, it is
going to use ptr -1, which might be outside of valid address space. Similar if
there would be one successful (short) read, but several -1 read results after.
Cheers,
Bernd
--
DataDirect Networks