Hi Patrick & Martin,
Yes I am able read the file with other programs without issues. At this point I have added
additional debugging code to print errno and list the file the file descriptor the program
is working with while this happens. just so I confirm the culprit. At this point I am
ruling out this is a short read, as I was not able to hit the bug with the re-producer
code attached to the short-read-bug-thread.
Currently the job is still running I will update you all when I find out more ....
Best Regards,
Amit
________________________________________
From: Martin Hecht [hecht(a)hlrs.de]
Sent: Friday, September 11, 2015 4:38 AM
To: Kumar, Amit
Cc: Patrick Farrell; Faaland, Olaf P.; Bernd Schubert; hpdd-discuss(a)lists.01.org
Subject: Re: [HPDD-discuss] Stale File is it possible?
so, if the read fails completely (i.e. no short read), can you read the
file in question by some other program from the same client, e.g. wc -c
file?
On 09/11/2015 03:23 AM, Patrick Farrell wrote:
So just to confirm, if it's saying -1, that pretty much rules out
short reads.
You might check dmesg or the Lustre logs for any errors around that time.
________________________________________
From: Kumar, Amit [ahkumar(a)mail.smu.edu]
Sent: Thursday, September 10, 2015 6:58 PM
To: Faaland, Olaf P.; Patrick Farrell; Bernd Schubert; Martin Hecht;
hpdd-discuss(a)lists.01.org
Subject: RE: [HPDD-discuss] Stale File is it possible?
HI Olaf,
Yes it does, It has it's own gperror which then calls perror. Although it only prints
the function name which I already know where it is. I will look at the perror and strerror
and see if I can update the gperror function to print additional information. Strangely
this happens to a special set of long running jobs rest of the long running jobs seem to
work fine. So let's see if I can dig into this further. Unfortunately each run takes
about 7 hours before it fails so hopefully I will have more info on this in the morning.
Thank you for your input!!!
Regards,
Amit
________________________________________
From: Faaland, Olaf P. [faaland1(a)llnl.gov]
Sent: Thursday, September 10, 2015 3:39 PM
To: Kumar, Amit; Patrick Farrell; Bernd Schubert; Martin Hecht;
hpdd-discuss(a)lists.01.org
Subject: RE: [HPDD-discuss] Stale File is it possible?
Amit,
> In all the errors that I have seen it always says Read -1 instead of 8
>
> This message is printed by Gaussian afte:
> if ((istat = read (fd, bufadr, len)) < 0) {
So if read() returned -1, then the read failed, and no bytes were returned at all. It
isn't necessarily a Lustre issue, it could be a bug in Gaussian or a misconfiguration
of some kind.
However the -1 just tells you the read failed, not the mode of failure. You need to know
the value of errno to find that out.
After this, does Gaussian call perror() and/or log the actual value of errno and its
corresponding string? It should. See man 3 perror and man 3 strerror. Then report back
what it says.
Good luck!
Olaf P. Faaland
Livermore Computing
phone : 925-422-2263
________________________________________
From: HPDD-discuss [hpdd-discuss-bounces(a)lists.01.org] on behalf of Kumar, Amit
[ahkumar(a)mail.smu.edu]
Sent: Thursday, September 10, 2015 1:12 PM
To: Patrick Farrell; Bernd Schubert; Martin Hecht; hpdd-discuss(a)lists.01.org
Subject: Re: [HPDD-discuss] Stale File is it possible?
Dear All,
Bernd: I ran with LD_PRELOAD for read and do not get any far, gaussian programs just sit
there spinning on read. In fact I think I may have to tweak the read function. I also ran
the re-producer functions attached by Bill Barth to the bug report, to see if I run into
this issue. And I could not reproduce the bug errors. So I am not sure if I am running in
right directions. Interestingly when I ran Bill Barth's re-producer code with
LD_PRELOAD function I get to see that BUG error.
I looked at the read man pages and it seems that it would return -1 or the length of
bytes read in situations that fall out of the defined errors.
In all the errors that I have seen it always says Read -1 instead of 8
This message is printed by Gaussian afte:
if ((istat = read (fd, bufadr, len)) < 0) {
Any hints to further debug this will be greatly appreciated.
I am trying recompiling of the code and see if I can add debug statements.
Best Regards,
Amit
________________________________________
From: Patrick Farrell [paf(a)cray.com]
Sent: Wednesday, September 09, 2015 10:05 AM
To: Kumar, Amit; Bernd Schubert; Martin Hecht; hpdd-discuss(a)lists.01.org
Subject: RE: [HPDD-discuss] Stale File is it possible?
Since the layout lock is in Lustre 2.4, I think LU-6389 is a possibility in 2.4. Good
luck.
________________________________________
From: Kumar, Amit [ahkumar(a)mail.smu.edu]
Sent: Wednesday, September 09, 2015 9:27 AM
To: Bernd Schubert; Patrick Farrell; Martin Hecht; hpdd-discuss(a)lists.01.org
Subject: RE: [HPDD-discuss] Stale File is it possible?
Awesome you all!! And sorry for the void in my response. Was out for extended long
weekend and catching up on things.
My version of Lustre is: 2.4.3 assuming LU-6389 also affects this version, this already
known fact could solve our mystery.
As Bernd pointed out, we compiled our own from source since we have license for that. I
will try the read-preload solution and see if we can relieve our users from this.
Exciting ..thank you for this hope ...
Best Regards,
Amit
> -----Original Message-----
> From: Bernd Schubert [mailto:bernd.schubert@fastmail.fm]
> Sent: Monday, September 07, 2015 3:14 PM
> To: Patrick Farrell; Kumar, Amit; Martin Hecht; hpdd-discuss(a)lists.01.org
> Subject: Re: [HPDD-discuss] Stale File is it possible?
>
> On 09/04/2015 11:09 PM, Patrick Farrell wrote:
>> Sent this once already, but in reply to the wrong message...
>>
>> Martin might know about that short read thing, since his site has a nice wiki
> page on it:
>>
https://wickie.hlrs.de/platforms/index.php/Lustre_short_read
>>
>> Technically Lustre is allowed to return fewer bytes than requested, as
>> it says on that page. But it doesn't normally - LU-6389 is a bug
>> where that can happen kind of often. (Again, it's technically allowed
>> as that page says... But it shouldn't really happen in practice,
>> which is why LU-6389 is a bug.)
>>
>> So perhaps Gaussian does not retry short reads? If memory serves, it's
> closed source, so you can't check - but perhaps you could ask the vendor?
>
> Depends on the license. At least in the past it was common for to get the
> Gaussian source code to be able to compile yourself or to modify it (have
> been doing that myself a long time ago).
>
> Short reads should be solvable using LD_PRELOAD. Although one might argue
> that Gaussian as well as Lustre should be able to handle this any better on
> their own.
> Attached are rather untested read-preload files. Compile it and then run the
> gaussian binary with something like
>
> LD_PRELOAD=<path/to/file>/read-preload.so <binary>
>
>
> Btw, the wiki code is not ideal, assuming already the first read returns -1, it is
> going to use ptr -1, which might be outside of valid address space. Similar if
> there would be one successful (short) read, but several -1 read results after.
>
> Cheers,
> Bernd
>
> --
> DataDirect Networks