Basics:
Lustre 2.5.3
RHEL 6.x (2.6.32-504)
ZFS 0.6.3
MDS is configure with a bunch of zpool mirrors with SAS hard drives. Cache
and Log on Hitachi enterprise SSD, The MDS has 20 Sandy Bridge cores (plus
hyperthreading) and 128 GB of memory. 12 OSSes at this point.
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/md0 20G 3.7G 15G 21% /
tmpfs 64G 0 64G 0% /dev/shm
/dev/md2 58G 52M 55G 1% /scratch
lzfs/mgt 57T 768K 57T 1% /mnt/lustre/local/mgt
lzfs/mdt0 57T 47G 57T 1% /mnt/lustre/local/mdt0
172.17.210.11@o2ib9:/lzfs
933T 31T 902T 4% /lzfs
pool: lzfs
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
lzfs ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
scsi-35000cca03b9ad72c ONLINE 0 0 0
scsi-35000cca03bb35190 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
scsi-35000cca03b980e58 ONLINE 0 0 0
scsi-35000cca03bb565bc ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
scsi-35000cca03b9ad6b8 ONLINE 0 0 0
scsi-35000cca03ba934f0 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
scsi-35000cca03b9ad5b4 ONLINE 0 0 0
scsi-35000cca03bb34d6c ONLINE 0 0 0
mirror-4 ONLINE 0 0 0
scsi-35000cca03bb29720 ONLINE 0 0 0
scsi-35000cca03b81e630 ONLINE 0 0 0
mirror-5 ONLINE 0 0 0
scsi-35000cca03bb19bb0 ONLINE 0 0 0
scsi-35000cca03b9798f4 ONLINE 0 0 0
mirror-6 ONLINE 0 0 0
scsi-35000cca03bb29b04 ONLINE 0 0 0
scsi-35000cca03bb36490 ONLINE 0 0 0
mirror-7 ONLINE 0 0 0
scsi-35000cca03ba577a8 ONLINE 0 0 0
scsi-35000cca03baf9cdc ONLINE 0 0 0
mirror-8 ONLINE 0 0 0
scsi-35000cca03bb19c60 ONLINE 0 0 0
scsi-35000cca03b9aa69c ONLINE 0 0 0
mirror-9 ONLINE 0 0 0
scsi-35000cca03bb05878 ONLINE 0 0 0
scsi-35000cca03b99f7a4 ONLINE 0 0 0
mirror-10 ONLINE 0 0 0
scsi-35000cca03bb15830 ONLINE 0 0 0
scsi-35000cca03b9147d4 ONLINE 0 0 0
mirror-11 ONLINE 0 0 0
scsi-35000cca03bb34f20 ONLINE 0 0 0
scsi-35000cca03b99b740 ONLINE 0 0 0
mirror-12 ONLINE 0 0 0
scsi-35000cca03b9f84e8 ONLINE 0 0 0
scsi-35000cca03b9a8a0c ONLINE 0 0 0
mirror-13 ONLINE 0 0 0
scsi-35000cca03bb042b0 ONLINE 0 0 0
scsi-35000cca03b99ec20 ONLINE 0 0 0
mirror-14 ONLINE 0 0 0
scsi-35000cca03baccce4 ONLINE 0 0 0
scsi-35000cca03b90ced0 ONLINE 0 0 0
mirror-15 ONLINE 0 0 0
scsi-35000cca03ba930f4 ONLINE 0 0 0
scsi-35000cca03b9a8984 ONLINE 0 0 0
logs
scsi-35000a7203009dbcb ONLINE 0 0 0
cache
scsi-35000a7203009ccba ONLINE 0 0 0
spares
scsi-35000cca03bb34360 AVAIL
The file system serves a 300 node HPC cluster that isn't super busy with
I/O. We have started seeing a few disconnects from clients and it feels
like it is related to high load on the MDS.
Here is a "typical" output from "top". Everything at the top is
context
switching.
Tasks: 1620 total, 7 running, 1613 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 4.1%sy, 0.0%ni, 95.2%id, 0.6%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 132129680k total, 127914680k used, 4215000k free, 56840k buffers
Swap: 61407100k total, 20956k used, 61386144k free, 82024k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15 root RT 0 0 0 0 S 61.0 0.0 408:35.41 migration/3
63 root RT 0 0 0 0 S 57.3 0.0 366:21.47 migration/15
95 root RT 0 0 0 0 S 55.5 0.0 330:32.20 migration/23
75 root RT 0 0 0 0 S 25.9 0.0 6492:20 migration/18
67 root RT 0 0 0 0 S 18.5 0.0 307:36.79 migration/16
79 root RT 0 0 0 0 S 14.8 0.0 765:44.70 migration/19
47 root RT 0 0 0 0 R 12.9 0.0 361:58.77 migration/11
2983 root 20 0 0 0 0 R 7.4 0.0 124:22.16 kondemand/13
8790 root 20 0 16248 2328 832 R 7.4 0.0 0:00.06 top
We are using "noop" for the I/O scheduler on all the drives.
I've never seen this type of behavior with a Lustre MDS and I assume it is
related to ZFS but I'm at a loss after some googling. Is this "typical" of
a ZFS backed MDS? Any pointers to configuration or tuning would be
appreciated.
Thanks
Tim