Hi,
Thanks everyone for the input.
I do see "connection to ... was lost" on the client side, but I did not
see messages like "waiting_locks_callback()".
Below is another instance:
Error on client:
May 7 01:33:18 nodem14 kernel: Lustre:
2519:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for sent delay: [sent 1367908387/real 0] req@ffff880b5d73ac00
x1434028094681259/t0(0)
o101->scratch-OST000a-osc-ffff880c43e21400@192.168.100.4@o2ib:28/4 lens
296/352 e 0 to 1 dl 1367908398 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
May 7 01:33:18 nodem14 kernel: Lustre:
scratch-OST000a-osc-ffff880c43e21400: Connection to scratch-OST000a (at
192.168.100.4@o2ib) was lost; in progress operations using this service
will wait for recovery to complete
May 7 01:33:18 nodem14 kernel: Lustre:
2519:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for sent delay: [sent 1367908387/real 0] req@ffff8809a750b000
x1434028094681262/t0(0)
o101->scratch-OST000b-osc-ffff880c43e21400@192.168.100.4@o2ib:28/4 lens
296/352 e 0 to 1 dl 1367908398 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
May 7 01:33:18 nodem14 kernel: Lustre:
scratch-OST000b-osc-ffff880c43e21400: Connection to scratch-OST000b (at
192.168.100.4@o2ib) was lost; in progress operations using this service
will wait for recovery to complete
May 7 01:33:20 nodem14 kernel: LustreError: 11-0: an error occurred
while communicating with 192.168.100.4@o2ib. The ost_connect operation
failed with -16
May 7 01:33:20 nodem14 kernel: LustreError: 11-0: an error occurred
while communicating with 192.168.100.4@o2ib. The ost_connect operation
failed with -16
May 7 01:33:43 nodem14 kernel: Lustre:
scratch-OST0009-osc-ffff880c43e21400: Connection restored to
scratch-OST0009 (at 192.168.100.4@o2ib)
Error on OSS:
May 7 01:33:20 lustre-oss04 kernel: Lustre: scratch-OST000a: Bulk IO
write error with 77b5db75-5d82-1976-0116-5ef24f9febee (at
192.168.102.14@o2ib), client will retry: rc -110
May 7 01:33:43 lustre-oss04 kernel: Lustre: scratch-OST0009: Client
77b5db75-5d82-1976-0116-5ef24f9febee (at 192.168.102.14@o2ib) reconnecting
I agree it is caused by some I/O intensive application, at least
partially. I wonder if there is anything we can do on Lustre side to
alleviate the problem. Like, lower the number of threads, etc.
Thanks
Mi
Email Disclaimer:
www.stjude.org/emaildisclaimer
Consultation Disclaimer:
www.stjude.org/consultationdisclaimer