Hi Mi,
Nice to see you using this forum!
The circumstances are not clear why, but it appears that:
1. OST0006 refused the connection to a client
2. The client lost its connection to OST0008
3. The client was not able to reconnect in time to the OST to avoid eviction
4. The client was evicted
5. The client eventually reconnected to OST0008
Looks kinda like a network partition occurred, and Lustre's "recovery"
mechanism kicked in to ensure that data was not lost/corrupted. Do you have any
additional info?
--
Brett Lee
Sr. Systems Engineer
Intel High Performance Data Division
-----Original Message-----
From: hpdd-discuss-bounces(a)lists.01.org [mailto:hpdd-discuss-
bounces(a)lists.01.org] On Behalf Of Mi Zhou
Sent: Monday, May 06, 2013 8:27 AM
To: hpdd-discuss(a)lists.01.org
Subject: [HPDD-discuss] OST refused connection from client
Hi,
We sometimes see the following error message on OSSs. And the
May 5 20:47:16 lustre-oss03 kernel: Lustre: scratch-OST0006: Client
511ae429-07b7-f9ca-22b6-f0f8839b8029 (at 192.168.102.37@o2ib) refused
reconnection, still busy with 1 active RPCs
And on the client that it refused connection, the error is as below:
May 5 20:47:03 nodem37 kernel: Lustre:
2424:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for sent delay: [sent 1367804814/real 0] req@ffff881849d84800
x1433750448809719/t0(0)
o101->scratch-OST0008-osc-ffff880c3fe37400@192.168.100.3@o2ib:28/4 lens
296/352 e 0 to 1 dl 1367804823 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 May 5 20:47:03
nodem37 kernel: Lustre:
2424:0:(client.c:1780:ptlrpc_expire_one_request()) Skipped 4 previous
similar messages May 5 20:47:03 nodem37 kernel: Lustre:
scratch-OST0008-osc-ffff880c3fe37400: Connection to scratch-OST0008 (at
192.168.100.3@o2ib) was lost; in progress operations using this service will
wait for recovery to complete May 5 20:47:03 nodem37 kernel: Lustre:
Skipped 1 previous similar message May 5 20:47:04 nodem37 kernel: Lustre:
2424:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for sent delay: [sent 1367804815/real 0] req@ffff880a86515400
x1433750448809779/t0(0)
o101->scratch-OST0008-osc-ffff880c3fe37400@192.168.100.3@o2ib:28/4 lens
296/352 e 0 to 1 dl 1367804824 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 May 5 20:47:04
nodem37 kernel: Lustre:
2424:0:(client.c:1780:ptlrpc_expire_one_request()) Skipped 6 previous
similar messages May 5 20:47:05 nodem37 kernel: LustreError: 11-0: an error
occurred while communicating with 192.168.100.3@o2ib. The ost_connect
operation failed with -16 May 5 20:47:05 nodem37 kernel: LustreError:
Skipped 1 previous similar message May 5 20:47:05 nodem37 kernel:
LustreError: 11-0: an error occurred while communicating with
192.168.100.3@o2ib. The ost_connect operation failed with -16 May 5
20:47:28 nodem37 kernel: Lustre:
scratch-OST0007-osc-ffff880c3fe37400: Connection restored to
scratch-OST0007 (at 192.168.100.3@o2ib)
May 5 20:47:28 nodem37 kernel: Lustre: Skipped 1 previous similar message
May 5 20:49:09 nodem37 kernel: LustreError: 11-0: an error occurred while
communicating with 192.168.100.3@o2ib. The ost_destroy operation failed
with -107 May 5 20:49:09 nodem37 kernel: LustreError: Skipped 1 previous
similar message May 5 20:49:09 nodem37 kernel: Lustre:
scratch-OST0008-osc-ffff880c3fe37400: Connection to scratch-OST0008 (at
192.168.100.3@o2ib) was lost; in progress operations using this service will
wait for recovery to complete May 5 20:49:09 nodem37 kernel: Lustre:
Skipped 2 previous similar messages May 5 20:49:09 nodem37 kernel:
LustreError: 167-0: This client was evicted by scratch-OST0008; in progress
operations using this service will fail.
May 5 20:49:09 nodem37 kernel: LustreError:
2422:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_INVALID
req@ffff88184061d400 x1433750448823924/t0(0)
o4->scratch-OST0008-osc-ffff880c3fe37400@192.168.100.3@o2ib:6/4 lens
456/416 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1 May 5 20:49:09 nodem37
kernel: LustreError:
2422:0:(client.c:1060:ptlrpc_import_delay_req()) Skipped 5687 previous
similar messages May 5 20:49:09 nodem37 kernel: LustreError:
5585:0:(osc_lock.c:809:osc_ldlm_completion_ast())
lock@ffff88128c19b698[2 2 0 1 1 00000000] W(2):[0,
0]@[0x100080000:0xcdb5aed5:0x0] { May 5 20:49:09 nodem37 kernel:
LustreError:
5585:0:(osc_lock.c:809:osc_ldlm_completion_ast())
lovsub@ffff880db54ec860: [0 ffff8810e95d6e30 W(2):[0,
0]@[0x201c50c90:0x16927:0x0]] May 5 20:49:09 nodem37 kernel:
LustreError:
5585:0:(osc_lock.c:809:osc_ldlm_completion_ast()) osc@ffff88169bf71d78:
ffff881344ac6240 40120002 0x7293132dc153773c 2 (null) size: 0 mtime:
1367804804 atime: 1367804804 ctime: 1367804804 blocks: 0 May 5 20:49:09
nodem37 kernel: LustreError:
5585:0:(osc_lock.c:809:osc_ldlm_completion_ast()) } lock@ffff88128c19b698
May 5 20:49:09 nodem37 kernel: LustreError:
5585:0:(osc_lock.c:809:osc_ldlm_completion_ast()) dlmlock returned -5 May
5 20:49:09 nodem37 kernel: Lustre:
scratch-OST0008-osc-ffff880c3fe37400: Connection restored to
scratch-OST0008 (at 192.168.100.3@o2ib)
Has anybody seen this? Any advice is appreciated.
Thanks
Mi
Email Disclaimer:
www.stjude.org/emaildisclaimer Consultation Disclaimer:
www.stjude.org/consultationdisclaimer
_______________________________________________
HPDD-discuss mailing list
HPDD-discuss(a)lists.01.org
https://lists.01.org/mailman/listinfo/hpdd-discuss