Thank you Martin for your response. I will see if I can reproduce that.
Regards,
Amit
From: Martin Hecht [mailto:hecht@hlrs.de]
Sent: Monday, June 18, 2018 3:00 AM
To: Kumar, Amit <ahkumar(a)mail.smu.edu>; <hpdd-discuss(a)lists.01.org>
<hpdd-discuss(a)lists.01.org>
Subject: Re: [HPDD-discuss] File Corruption even though md5sum are same on the copy
Hi Amit,
This could be short reads that might not be handled correctly in your python application.
A POSIX file system may read less bytes than requested by the application into the read
buffer, but it shall return the number of bytes read. The application must check if as
many bytes as requested where read, and if not, it shall repeat the attempt to read.
The md5sum command probably handles short reads correcty, or the file is cached when you
look the second time and therefore the short read doesn't happen anymore.
We had some issues with short reads in lustre as well. There are patches available that
reduce the number of short reads (see e.g. LU-6389, LU-6392), but even with those patches
short reads may occur, and that's totally ok, and your application (or the libraries
that it uses for IO) should be fixed as well, so that they handle short reads properly.
kind regards,
Martin
On 06/15/2018 06:33 PM, Kumar, Amit wrote:
Dear Lustre,
This is not a critical issue but wondering if this is a possibility and if anyone has
noticed anything similar. We have an installation of Lustre 2.7 with ZFS backed
MDT's/OST's, and don't have any issues other than this odd behavior with some
set of files.
We have run into some files that fail with an error by a python application, complaining
corruption. Although when I make a copy of the same(complained by the application of being
corrupt) file in the lustre scratch to a new location in the lustre scratch, and then run
the application it runs successfully. This is puzzling and unsettling. Given md5sum
matches of both the files one that runs and one that is complained as being corrupt, I
wonder how could one explain this?
What other low level verification one can do to understand the differences in behavior?
Any pointers here will be a great help.
Thank you,
Amit
_______________________________________________
HPDD-discuss mailing list
HPDD-discuss@lists.01.org<mailto:HPDD-discuss@lists.01.org>
https://lists.01.org/mailman/listinfo/hpdd-discuss