Amit,
In all the errors that I have seen it always says Read -1 instead of
8
This message is printed by Gaussian afte:
if ((istat = read (fd, bufadr, len)) < 0) {
So if read() returned -1, then the read failed, and no bytes were returned at all. It
isn't necessarily a Lustre issue, it could be a bug in Gaussian or a misconfiguration
of some kind.
However the -1 just tells you the read failed, not the mode of failure. You need to know
the value of errno to find that out.
After this, does Gaussian call perror() and/or log the actual value of errno and its
corresponding string? It should. See man 3 perror and man 3 strerror. Then report back
what it says.
Good luck!
Olaf P. Faaland
Livermore Computing
phone : 925-422-2263
________________________________________
From: HPDD-discuss [hpdd-discuss-bounces(a)lists.01.org] on behalf of Kumar, Amit
[ahkumar(a)mail.smu.edu]
Sent: Thursday, September 10, 2015 1:12 PM
To: Patrick Farrell; Bernd Schubert; Martin Hecht; hpdd-discuss(a)lists.01.org
Subject: Re: [HPDD-discuss] Stale File is it possible?
Dear All,
Bernd: I ran with LD_PRELOAD for read and do not get any far, gaussian programs just sit
there spinning on read. In fact I think I may have to tweak the read function. I also ran
the re-producer functions attached by Bill Barth to the bug report, to see if I run into
this issue. And I could not reproduce the bug errors. So I am not sure if I am running in
right directions. Interestingly when I ran Bill Barth's re-producer code with
LD_PRELOAD function I get to see that BUG error.
I looked at the read man pages and it seems that it would return -1 or the length of bytes
read in situations that fall out of the defined errors.
In all the errors that I have seen it always says Read -1 instead of 8
This message is printed by Gaussian afte:
if ((istat = read (fd, bufadr, len)) < 0) {
Any hints to further debug this will be greatly appreciated.
I am trying recompiling of the code and see if I can add debug statements.
Best Regards,
Amit
________________________________________
From: Patrick Farrell [paf(a)cray.com]
Sent: Wednesday, September 09, 2015 10:05 AM
To: Kumar, Amit; Bernd Schubert; Martin Hecht; hpdd-discuss(a)lists.01.org
Subject: RE: [HPDD-discuss] Stale File is it possible?
Since the layout lock is in Lustre 2.4, I think LU-6389 is a possibility in 2.4. Good
luck.
________________________________________
From: Kumar, Amit [ahkumar(a)mail.smu.edu]
Sent: Wednesday, September 09, 2015 9:27 AM
To: Bernd Schubert; Patrick Farrell; Martin Hecht; hpdd-discuss(a)lists.01.org
Subject: RE: [HPDD-discuss] Stale File is it possible?
Awesome you all!! And sorry for the void in my response. Was out for extended long weekend
and catching up on things.
My version of Lustre is: 2.4.3 assuming LU-6389 also affects this version, this already
known fact could solve our mystery.
As Bernd pointed out, we compiled our own from source since we have license for that. I
will try the read-preload solution and see if we can relieve our users from this.
Exciting ..thank you for this hope ...
Best Regards,
Amit
-----Original Message-----
From: Bernd Schubert [mailto:bernd.schubert@fastmail.fm]
Sent: Monday, September 07, 2015 3:14 PM
To: Patrick Farrell; Kumar, Amit; Martin Hecht; hpdd-discuss(a)lists.01.org
Subject: Re: [HPDD-discuss] Stale File is it possible?
On 09/04/2015 11:09 PM, Patrick Farrell wrote:
> Sent this once already, but in reply to the wrong message...
>
> Martin might know about that short read thing, since his site has a nice wiki
page on it:
>
https://wickie.hlrs.de/platforms/index.php/Lustre_short_read
>
> Technically Lustre is allowed to return fewer bytes than requested, as
> it says on that page. But it doesn't normally - LU-6389 is a bug
> where that can happen kind of often. (Again, it's technically allowed
> as that page says... But it shouldn't really happen in practice,
> which is why LU-6389 is a bug.)
>
> So perhaps Gaussian does not retry short reads? If memory serves, it's
closed source, so you can't check - but perhaps you could ask the vendor?
Depends on the license. At least in the past it was common for to get the
Gaussian source code to be able to compile yourself or to modify it (have
been doing that myself a long time ago).
Short reads should be solvable using LD_PRELOAD. Although one might argue
that Gaussian as well as Lustre should be able to handle this any better on
their own.
Attached are rather untested read-preload files. Compile it and then run the
gaussian binary with something like
LD_PRELOAD=<path/to/file>/read-preload.so <binary>
Btw, the wiki code is not ideal, assuming already the first read returns -1, it is
going to use ptr -1, which might be outside of valid address space. Similar if
there would be one successful (short) read, but several -1 read results after.
Cheers,
Bernd
--
DataDirect Networks
_______________________________________________
HPDD-discuss mailing list
HPDD-discuss(a)lists.01.org
https://lists.01.org/mailman/listinfo/hpdd-discuss