Hello,
Thanks @Andreas for the tip.
Now, on one side, we have:
# cat /proc/fs/lustre/mdt/lustre-1-MDT0000/exports/172.20.18.5@tcp1/open_files | wc -l
39911
While, on the client, we have:
# ifconfig eth0 | fgrep inet
inet addr:172.20.18.5 Bcast:172.23.255.255 Mask:255.248.0.0
# lsof /remote/lustre/1/
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
bash 28915 dpalaz cwd DIR 3862,716834 4096 144116816069792445
/remote/lustre/1/temp/dpalaz/database/aurora2/word-decode/hmm-wrd
And while, globally, we also have problems re-conciliating user quotas with what
'du'/'lfs find' show:
# lfs quota -v -u dpalaz /remote/lustre/1/
Disk quotas for user dpalaz (uid 1437):
Filesystem kbytes quota limit grace files quota limit grace
/remote/lustre/1/
2884166288 3218079744 3221225472 - 37184 9999900 10000000
-
lustre-1-MDT0000_UUID
3308 - 0 - 37184 - 3782108 -
lustre-1-OST0000_UUID
31523048 - 48289260 - - - - -
lustre-1-OST0001_UUID
32057272 - 48822328 - - - - -
lustre-1-OST0002_UUID
40102808 - 56869588 - - - - -
lustre-1-OST0003_UUID
38586184 - 55351728 - - - - -
lustre-1-OST0004_UUID
32152436 - 48916992 - - - - -
lustre-1-OST0005_UUID
37890432 - 54656832 - - - - -
lustre-1-OST0006_UUID
41217912 - 57985512 - - - - -
lustre-1-OST0007_UUID
36945752 - 53711000 - - - - -
lustre-1-OST0008_UUID
35597944 - 52363128 - - - - -
lustre-1-OST0009_UUID
30781596 - 47545668 - - - - -
lustre-1-OST000a_UUID
2494494448 - 2511263500 - - - - -
lustre-1-OST000b_UUID
32813148 - 49578236 - - - - -
('du'/'lfs find' show figures around 411GiB and ~580'000 files; the
missing 2.4TB somehow being consistent with the quota figure on lustre-1-OST000a)
What could explain this behavior ?
We're running Lustre 2.5.2 server-side, 2.4.? client-side (with plan to update clients
to 2.6)
Thanks and best,
Cédric
On 14/03/15 11:20, Dilger, Andreas wrote:
If you know the specific client NID that is doing the IO, and you
have a newer version of Lustre, you can check the open files on the MDS:
lctl get_param mdt.*.*exports.{NID}.open_files
Or something similar. This /proc entry might have only landed in 2.6.
Cheers, Andreas
On Mar 13, 2015, at 03:18, Cédric Dufour - Idiap Research Institute
<cedric.dufour@idiap.ch<mailto:cedric.dufour@idiap.ch>> wrote:
Hello,
Lately, one of our (12TB) OST jumped from ~50% to 100% capacity in a matter of hours.
We switched that OST to INACTIVE before it reached 100% but it kept filling, indicating
an ongoing file write.
At the time it reached 100%, we got an ENOSPC on one of our client:
helvetix05: 2014-03-11 09:35:06 helvetix05 kernel: [1610930.474849] LustreError:
4431:0:(vvp_io.c:1022:vvp_io_commit_write()) Write page 1572607068 of inode
ffff8800c96277c8 failed -28
helvetix05: 2014-03-11 09:35:06 helvetix05 kernel: [1610930.692143] LustreError:
4431:0:(vvp_io.c:1022:vvp_io_commit_write()) Write page 1572607068 of inode
ffff8800c96277c8 failed -28
We tried to catch the run-away file using 'lfs find' but with a 250mio-files
filesystem, this is no easy feat.
We also asked the suspected user, but he has no idea what/how things went wrong.
QUESTION:
Can we assume the 1572607068 page figure point to a 6TB file (1572607068*4096 bytes) ?
(this would be consistent with the given OST capacity figures)
Is there a way to find which file corresponds to the ffff8800c96277c8 inode ?
Is there a way to perform the equivalent of the 'lfs find' directly on the MDS
(e.g. by mounting the underlying ldiskfs) ?
Thanks for your help,
Cédric
--
Cédric Dufour @ Idiap Research Institute
EPFL Engineer
E-mail: mailto:cedric.dufour@idiap.ch
Phone: +41 27 721 77 40
Fax: +41 27 721 77 12
Mail: Idiap Research Institute
Case postale 592
Centre du Parc - Rue Marconi 19
1920 Martigny (VS)
Suisse (Switzerland)
Website:
http://www.idiap.ch /
http://www.idiap.ch/~cdufour<http://www.idiap.ch/%7Ecdufour>
_______________________________________________
HPDD-discuss mailing list
HPDD-discuss@lists.01.org<mailto:HPDD-discuss@lists.01.org>
https://lists.01.org/mailman/listinfo/hpdd-discuss