Cedric,
When I have had similar issues, I start by using “lfs quota” to generate a list of each
user’s usage. There is a flag (which I don’t remember off the top of my head) that will
report usage for a specific OST so I can target only the OST that is filling up. I then
sort the list, and usually the top user is significantly higher than the #2 user. While
this may not guarantee that the top user is the offender, it gives you a more focused
starting point (but in my experience, the top user has always been the one who caused the
problem).
After that, I do a little manual digging to try to find the offending file. I’ll look for
interactive processes on login nodes that might be creating large files (ex - tar, hsi).
I’ll also take a look at the user’s scratch directory and check out the files/dirs with
the most recent timestamps. This is usually successful for me about 90% of the time and I
can often locate the file within 15-30 mins. I also will check if the user is running any
batch jobs, but since the jobs can potentially be large, it is not always easy to
determine all the file handles that the job has open.
If those steps are not successful, then I resort to running “lfs find” on that user’s
scratch directory to locate the file. Since I am only targeting a portion of the file
system, there is some hope that the command will complete before the OST fills up.
[I know this doesn’t answer the technical questions that you had, but I thought there
might be some useful info there. If nothing else, it might be relevant to other admins
who have to deal with similar problems.]
--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu
On Mar 13, 2015, at 5:17 AM, Cédric Dufour - Idiap Research Institute
<cedric.dufour(a)idiap.ch> wrote:
Hello,
Lately, one of our (12TB) OST jumped from ~50% to 100% capacity in a matter of hours.
We switched that OST to INACTIVE before it reached 100% but it kept filling, indicating
an ongoing file write.
At the time it reached 100%, we got an ENOSPC on one of our client:
helvetix05: 2014-03-11 09:35:06 helvetix05 kernel: [1610930.474849] LustreError:
4431:0:(vvp_io.c:1022:vvp_io_commit_write()) Write page 1572607068 of inode
ffff8800c96277c8 failed -28
helvetix05: 2014-03-11 09:35:06 helvetix05 kernel: [1610930.692143] LustreError:
4431:0:(vvp_io.c:1022:vvp_io_commit_write()) Write page 1572607068 of inode
ffff8800c96277c8 failed -28
We tried to catch the run-away file using 'lfs find' but with a 250mio-files
filesystem, this is no easy feat.
We also asked the suspected user, but he has no idea what/how things went wrong.
QUESTION:
Can we assume the 1572607068 page figure point to a 6TB file (1572607068*4096 bytes) ?
(this would be consistent with the given OST capacity figures)
Is there a way to find which file corresponds to the ffff8800c96277c8 inode ?
Is there a way to perform the equivalent of the 'lfs find' directly on the MDS
(e.g. by mounting the underlying ldiskfs) ?
Thanks for your help,
Cédric
--
Cédric Dufour @ Idiap Research Institute
EPFL Engineer
E-mail: mailto:cedric.dufour@idiap.ch
Phone: +41 27 721 77 40
Fax: +41 27 721 77 12
Mail: Idiap Research Institute
Case postale 592
Centre du Parc - Rue Marconi 19
1920 Martigny (VS)
Suisse (Switzerland)
Website:
http://www.idiap.ch /
http://www.idiap.ch/~cdufour
_______________________________________________
HPDD-discuss mailing list
HPDD-discuss(a)lists.01.org
https://lists.01.org/mailman/listinfo/hpdd-discuss