Hello,
we see similar messages pretty frequently - about once per week.
I have no solution but the "osc_ldlm_completion_ast()) dlmlock returned
-5" messages usually also indicate that the applications are getting
I/O errors. I strongly believe that the problem is not related to
hardware or network problems and is rather caused by "bad" applications
(e.g. one user reported that he is doing a lot of I/O with several
thousand files). I definitively see that only few users are getting
this kind of errors on different clients.
We are currently using Lustre 2.1.3 on servers and Lustre 2.3.0 plus
patches on clients. However, I've also seen similar messages before
we had migrated, i.e. with Lustre 1.8.7-wc1.
Regards,
Roland
Am 06.05.2013 22:34, schrieb Lee, Brett:
Hi Mi,
Nice to see you using this forum!
The circumstances are not clear why, but it appears that:
1. OST0006 refused the connection to a client
2. The client lost its connection to OST0008
3. The client was not able to reconnect in time to the OST to avoid eviction
4. The client was evicted
5. The client eventually reconnected to OST0008
Looks kinda like a network partition occurred, and Lustre's "recovery"
mechanism kicked in to ensure that data was not lost/corrupted. Do you have any
additional info?
--
Brett Lee
Sr. Systems Engineer
Intel High Performance Data Division
> -----Original Message-----
> From: hpdd-discuss-bounces(a)lists.01.org [mailto:hpdd-discuss-
> bounces(a)lists.01.org] On Behalf Of Mi Zhou
> Sent: Monday, May 06, 2013 8:27 AM
> To: hpdd-discuss(a)lists.01.org
> Subject: [HPDD-discuss] OST refused connection from client
>
> Hi,
>
> We sometimes see the following error message on OSSs. And the
>
> May 5 20:47:16 lustre-oss03 kernel: Lustre: scratch-OST0006: Client
> 511ae429-07b7-f9ca-22b6-f0f8839b8029 (at 192.168.102.37@o2ib) refused
> reconnection, still busy with 1 active RPCs
>
> And on the client that it refused connection, the error is as below:
>
> May 5 20:47:03 nodem37 kernel: Lustre:
> 2424:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has
> timed out for sent delay: [sent 1367804814/real 0] req@ffff881849d84800
> x1433750448809719/t0(0)
> o101->scratch-OST0008-osc-ffff880c3fe37400@192.168.100.3@o2ib:28/4 lens
> 296/352 e 0 to 1 dl 1367804823 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 May 5 20:47:03
> nodem37 kernel: Lustre:
> 2424:0:(client.c:1780:ptlrpc_expire_one_request()) Skipped 4 previous
> similar messages May 5 20:47:03 nodem37 kernel: Lustre:
> scratch-OST0008-osc-ffff880c3fe37400: Connection to scratch-OST0008 (at
> 192.168.100.3@o2ib) was lost; in progress operations using this service will
> wait for recovery to complete May 5 20:47:03 nodem37 kernel: Lustre:
> Skipped 1 previous similar message May 5 20:47:04 nodem37 kernel: Lustre:
> 2424:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has
> timed out for sent delay: [sent 1367804815/real 0] req@ffff880a86515400
> x1433750448809779/t0(0)
> o101->scratch-OST0008-osc-ffff880c3fe37400@192.168.100.3@o2ib:28/4 lens
> 296/352 e 0 to 1 dl 1367804824 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 May 5 20:47:04
> nodem37 kernel: Lustre:
> 2424:0:(client.c:1780:ptlrpc_expire_one_request()) Skipped 6 previous
> similar messages May 5 20:47:05 nodem37 kernel: LustreError: 11-0: an error
> occurred while communicating with 192.168.100.3@o2ib. The ost_connect
> operation failed with -16 May 5 20:47:05 nodem37 kernel: LustreError:
> Skipped 1 previous similar message May 5 20:47:05 nodem37 kernel:
> LustreError: 11-0: an error occurred while communicating with
> 192.168.100.3@o2ib. The ost_connect operation failed with -16 May 5
> 20:47:28 nodem37 kernel: Lustre:
> scratch-OST0007-osc-ffff880c3fe37400: Connection restored to
> scratch-OST0007 (at 192.168.100.3@o2ib)
> May 5 20:47:28 nodem37 kernel: Lustre: Skipped 1 previous similar message
> May 5 20:49:09 nodem37 kernel: LustreError: 11-0: an error occurred while
> communicating with 192.168.100.3@o2ib. The ost_destroy operation failed
> with -107 May 5 20:49:09 nodem37 kernel: LustreError: Skipped 1 previous
> similar message May 5 20:49:09 nodem37 kernel: Lustre:
> scratch-OST0008-osc-ffff880c3fe37400: Connection to scratch-OST0008 (at
> 192.168.100.3@o2ib) was lost; in progress operations using this service will
> wait for recovery to complete May 5 20:49:09 nodem37 kernel: Lustre:
> Skipped 2 previous similar messages May 5 20:49:09 nodem37 kernel:
> LustreError: 167-0: This client was evicted by scratch-OST0008; in progress
> operations using this service will fail.
> May 5 20:49:09 nodem37 kernel: LustreError:
> 2422:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_INVALID
> req@ffff88184061d400 x1433750448823924/t0(0)
> o4->scratch-OST0008-osc-ffff880c3fe37400@192.168.100.3@o2ib:6/4 lens
> 456/416 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1 May 5 20:49:09 nodem37
> kernel: LustreError:
> 2422:0:(client.c:1060:ptlrpc_import_delay_req()) Skipped 5687 previous
> similar messages May 5 20:49:09 nodem37 kernel: LustreError:
> 5585:0:(osc_lock.c:809:osc_ldlm_completion_ast())
> lock@ffff88128c19b698[2 2 0 1 1 00000000] W(2):[0,
> 0]@[0x100080000:0xcdb5aed5:0x0] { May 5 20:49:09 nodem37 kernel:
> LustreError:
> 5585:0:(osc_lock.c:809:osc_ldlm_completion_ast())
> lovsub@ffff880db54ec860: [0 ffff8810e95d6e30 W(2):[0,
> 0]@[0x201c50c90:0x16927:0x0]] May 5 20:49:09 nodem37 kernel:
> LustreError:
> 5585:0:(osc_lock.c:809:osc_ldlm_completion_ast()) osc@ffff88169bf71d78:
> ffff881344ac6240 40120002 0x7293132dc153773c 2 (null) size: 0 mtime:
> 1367804804 atime: 1367804804 ctime: 1367804804 blocks: 0 May 5 20:49:09
> nodem37 kernel: LustreError:
> 5585:0:(osc_lock.c:809:osc_ldlm_completion_ast()) } lock@ffff88128c19b698
> May 5 20:49:09 nodem37 kernel: LustreError:
> 5585:0:(osc_lock.c:809:osc_ldlm_completion_ast()) dlmlock returned -5 May
> 5 20:49:09 nodem37 kernel: Lustre:
> scratch-OST0008-osc-ffff880c3fe37400: Connection restored to
> scratch-OST0008 (at 192.168.100.3@o2ib)
>
> Has anybody seen this? Any advice is appreciated.
>
> Thanks
>
> Mi
>
> Email Disclaimer:
www.stjude.org/emaildisclaimer Consultation Disclaimer:
>
www.stjude.org/consultationdisclaimer
>
> _______________________________________________
> HPDD-discuss mailing list
> HPDD-discuss(a)lists.01.org
>
https://lists.01.org/mailman/listinfo/hpdd-discuss
_______________________________________________
HPDD-discuss mailing list
HPDD-discuss(a)lists.01.org
https://lists.01.org/mailman/listinfo/hpdd-discuss