On May 7, 2013, at 3:34 AM, "Laifer, Roland (SCC)"
<roland.laifer@kit.edu<mailto:roland.laifer@kit.edu>> wrote:
Hello,
we see similar messages pretty frequently - about once per week.
I have no solution but the "osc_ldlm_completion_ast()) dlmlock returned
-5" messages usually also indicate that the applications are getting
Usually this means the client was evicted by the OST. Do you see messages in the syslog
such as:
"Connection to … was lost"
on the client side, or
"waiting_locks_callback()) ### lock callback timer expired after …"
on the OST?
If this is the case, please open a new ticket on
jira.hpdd.intel.com<http://jira.hpdd.intel.com> and attach Lustre log messages from
both client and OST by the command `lctl dk logfile' respectively.
Jinshan
I/O errors. I strongly believe that the problem is not related to
hardware or network problems and is rather caused by "bad" applications
(e.g. one user reported that he is doing a lot of I/O with several
thousand files). I definitively see that only few users are getting
this kind of errors on different clients.
We are currently using Lustre 2.1.3 on servers and Lustre 2.3.0 plus
patches on clients. However, I've also seen similar messages before
we had migrated, i.e. with Lustre 1.8.7-wc1.
Regards,
Roland
Am 06.05.2013 22:34, schrieb Lee, Brett:
Hi Mi,
Nice to see you using this forum!
The circumstances are not clear why, but it appears that:
1. OST0006 refused the connection to a client
2. The client lost its connection to OST0008
3. The client was not able to reconnect in time to the OST to avoid eviction
4. The client was evicted
5. The client eventually reconnected to OST0008
Looks kinda like a network partition occurred, and Lustre's "recovery"
mechanism kicked in to ensure that data was not lost/corrupted. Do you have any
additional info?
--
Brett Lee
Sr. Systems Engineer
Intel High Performance Data Division
-----Original Message-----
From: hpdd-discuss-bounces@lists.01.org<mailto:hpdd-discuss-bounces@lists.01.org>
[mailto:hpdd-discuss-
bounces@lists.01.org<mailto:bounces@lists.01.org>] On Behalf Of Mi Zhou
Sent: Monday, May 06, 2013 8:27 AM
To: hpdd-discuss@lists.01.org<mailto:hpdd-discuss@lists.01.org>
Subject: [HPDD-discuss] OST refused connection from client
Hi,
We sometimes see the following error message on OSSs. And the
May 5 20:47:16 lustre-oss03 kernel: Lustre: scratch-OST0006: Client
511ae429-07b7-f9ca-22b6-f0f8839b8029 (at 192.168.102.37@o2ib) refused
reconnection, still busy with 1 active RPCs
And on the client that it refused connection, the error is as below:
May 5 20:47:03 nodem37 kernel: Lustre:
2424:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for sent delay: [sent 1367804814/real 0] req@ffff881849d84800
x1433750448809719/t0(0)
o101->scratch-OST0008-osc-ffff880c3fe37400@192.168.100.3<mailto:scratch-OST0008-osc-ffff880c3fe37400@192.168.100.3>@o2ib:28/4
lens
296/352 e 0 to 1 dl 1367804823 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 May 5 20:47:03
nodem37 kernel: Lustre:
2424:0:(client.c:1780:ptlrpc_expire_one_request()) Skipped 4 previous
similar messages May 5 20:47:03 nodem37 kernel: Lustre:
scratch-OST0008-osc-ffff880c3fe37400: Connection to scratch-OST0008 (at
192.168.100.3@o2ib) was lost; in progress operations using this service will
wait for recovery to complete May 5 20:47:03 nodem37 kernel: Lustre:
Skipped 1 previous similar message May 5 20:47:04 nodem37 kernel: Lustre:
2424:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for sent delay: [sent 1367804815/real 0] req@ffff880a86515400
x1433750448809779/t0(0)
o101->scratch-OST0008-osc-ffff880c3fe37400@192.168.100.3<mailto:scratch-OST0008-osc-ffff880c3fe37400@192.168.100.3>@o2ib:28/4
lens
296/352 e 0 to 1 dl 1367804824 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 May 5 20:47:04
nodem37 kernel: Lustre:
2424:0:(client.c:1780:ptlrpc_expire_one_request()) Skipped 6 previous
similar messages May 5 20:47:05 nodem37 kernel: LustreError: 11-0: an error
occurred while communicating with 192.168.100.3@o2ib. The ost_connect
operation failed with -16 May 5 20:47:05 nodem37 kernel: LustreError:
Skipped 1 previous similar message May 5 20:47:05 nodem37 kernel:
LustreError: 11-0: an error occurred while communicating with
192.168.100.3@o2ib. The ost_connect operation failed with -16 May 5
20:47:28 nodem37 kernel: Lustre:
scratch-OST0007-osc-ffff880c3fe37400: Connection restored to
scratch-OST0007 (at 192.168.100.3@o2ib)
May 5 20:47:28 nodem37 kernel: Lustre: Skipped 1 previous similar message
May 5 20:49:09 nodem37 kernel: LustreError: 11-0: an error occurred while
communicating with 192.168.100.3@o2ib. The ost_destroy operation failed
with -107 May 5 20:49:09 nodem37 kernel: LustreError: Skipped 1 previous
similar message May 5 20:49:09 nodem37 kernel: Lustre:
scratch-OST0008-osc-ffff880c3fe37400: Connection to scratch-OST0008 (at
192.168.100.3@o2ib) was lost; in progress operations using this service will
wait for recovery to complete May 5 20:49:09 nodem37 kernel: Lustre:
Skipped 2 previous similar messages May 5 20:49:09 nodem37 kernel:
LustreError: 167-0: This client was evicted by scratch-OST0008; in progress
operations using this service will fail.
May 5 20:49:09 nodem37 kernel: LustreError:
2422:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_INVALID
req@ffff88184061d400 x1433750448823924/t0(0)
o4->scratch-OST0008-osc-ffff880c3fe37400@192.168.100.3<mailto:scratch-OST0008-osc-ffff880c3fe37400@192.168.100.3>@o2ib:6/4
lens
456/416 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1 May 5 20:49:09 nodem37
kernel: LustreError:
2422:0:(client.c:1060:ptlrpc_import_delay_req()) Skipped 5687 previous
similar messages May 5 20:49:09 nodem37 kernel: LustreError:
5585:0:(osc_lock.c:809:osc_ldlm_completion_ast())
lock@ffff88128c19b698[2 2 0 1 1 00000000] W(2):[0,
0]@[0x100080000:0xcdb5aed5:0x0] { May 5 20:49:09 nodem37 kernel:
LustreError:
5585:0:(osc_lock.c:809:osc_ldlm_completion_ast())
lovsub@ffff880db54ec860: [0 ffff8810e95d6e30 W(2):[0,
0]@[0x201c50c90:0x16927:0x0]] May 5 20:49:09 nodem37 kernel:
LustreError:
5585:0:(osc_lock.c:809:osc_ldlm_completion_ast()) osc@ffff88169bf71d78:
ffff881344ac6240 40120002 0x7293132dc153773c 2 (null) size: 0 mtime:
1367804804 atime: 1367804804 ctime: 1367804804 blocks: 0 May 5 20:49:09
nodem37 kernel: LustreError:
5585:0:(osc_lock.c:809:osc_ldlm_completion_ast()) } lock@ffff88128c19b698
May 5 20:49:09 nodem37 kernel: LustreError:
5585:0:(osc_lock.c:809:osc_ldlm_completion_ast()) dlmlock returned -5 May
5 20:49:09 nodem37 kernel: Lustre:
scratch-OST0008-osc-ffff880c3fe37400: Connection restored to
scratch-OST0008 (at 192.168.100.3@o2ib)
Has anybody seen this? Any advice is appreciated.
Thanks
Mi
Email Disclaimer:
www.stjude.org/emaildisclaimer<http://www.stjude.org/emaildisclaimer> Consultation
Disclaimer:
www.stjude.org/consultationdisclaimer<http://www.stjude.org/consultati...
_______________________________________________
HPDD-discuss mailing list
HPDD-discuss(a)lists.01.org
https://lists.01.org/mailman/listinfo/hpdd-discuss
_______________________________________________
HPDD-discuss mailing list
HPDD-discuss@lists.01.org<mailto:HPDD-discuss@lists.01.org>
https://lists.01.org/mailman/listinfo/hpdd-discuss
_______________________________________________
HPDD-discuss mailing list
HPDD-discuss@lists.01.org<mailto:HPDD-discuss@lists.01.org>
https://lists.01.org/mailman/listinfo/hpdd-discuss