Hello,

Lately, one of our (12TB) OST jumped from ~50% to 100% capacity in a matter of hours.

We switched that OST to INACTIVE before it reached 100% but it kept filling, indicating an ongoing file write.

At the time it reached 100%, we got an ENOSPC on one of our client:

helvetix05: 2014-03-11 09:35:06 helvetix05 kernel: [1610930.474849] LustreError: 4431:0:(vvp_io.c:1022:vvp_io_commit_write()) Write page 1572607068 of inode ffff8800c96277c8 failed -28
helvetix05: 2014-03-11 09:35:06 helvetix05 kernel: [1610930.692143] LustreError: 4431:0:(vvp_io.c:1022:vvp_io_commit_write()) Write page 1572607068 of inode ffff8800c96277c8 failed -28

We tried to catch the run-away file using 'lfs find' but with a 250mio-files filesystem, this is no easy feat.
We also asked the suspected user, but he has no idea what/how things went wrong.

QUESTION:

Can we assume the 1572607068 page figure point to a 6TB file (1572607068*4096 bytes) ?
(this would be consistent with the given OST capacity figures)

Is there a way to find which file corresponds to the ffff8800c96277c8 inode ?

Is there a way to perform the equivalent of the 'lfs find' directly on the MDS (e.g. by mounting the underlying ldiskfs) ?

Thanks for your help,

Cédric

--

Cédric Dufour @ Idiap Research Institute
EPFL Engineer

E-mail:  mailto:cedric.dufour@idiap.ch
Phone:   +41 27 721 77 40
Fax:     +41 27 721 77 12
Mail:    Idiap Research Institute
         Case postale 592
         Centre du Parc - Rue Marconi 19
         1920 Martigny (VS)
         Suisse (Switzerland)
Website: http://www.idiap.ch / http://www.idiap.ch/~cdufour