Dear Lustre Experts,
We currently use Lustre Maintenance Release (Build Wahmcloud) 2.4.2 on
RHEL6U4 Linux.
We have an issue on Client side with a failure reconnection on the client
due to refused connection on the MDS.
We need to reboot the MDS in order to recover the client status.
Client Side :
Lustre: 8863:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent
has timed out for slow reply: [sent 1398453761/real 1398453761]
req@ffff88009e281800 x1465479219941096/t0(0)
o101->data1-MDT0000-mdc-ffff880335a7bc00@10.64.18.12@tcp1:12/10 lens
576/1136 e 5 to 1 dl 1398454684 ref 2 fl Rpc:XP/0/ffffffff rc 0/-1
Lustre: data1-MDT0000-mdc-ffff880335a7bc00: Connection to data1-MDT0000 (at
10.64.18.12@tcp1) was lost; in progress operations using this service will
wait for recovery to complete
LustreError: 11-0: data1-MDT0000-mdc-ffff880335a7bc00: Communicating with
10.64.18.12@tcp1, operation mds_connect failed with -16.
LustreError: 11-0: data1-MDT0000-mdc-ffff880335a7bc00: Communicating with
10.64.18.12@tcp1, operation mds_connect failed with -16.
LustreError: Skipped 1 previous similar message
LustreError: 11-0: data1-MDT0000-mdc-ffff880335a7bc00: Communicating with
10.64.18.12@tcp1, operation mds_connect failed with -16.
LustreError: Skipped 2 previous similar messages
LustreError: 26416:0:(lmv_obd.c:1289:lmv_statfs()) can't stat MDS #0
(data1-MDT0000-mdc-ffff880335a7bc00), error -16
LustreError: 26416:0:(llite_lib.c:1610:ll_statfs_internal()) md_statfs
fails: rc = -16
LustreError: 11-0: data1-MDT0000-mdc-ffff880335a7bc00: Communicating with
10.64.18.12@tcp1, operation mds_connect failed with -16.
LustreError: Skipped 5 previous similar messages
LustreError: 26474:0:(lmv_obd.c:1289:lmv_statfs()) can't stat MDS #0
(data1-MDT0000-mdc-ffff880335a7bc00), error -16
LustreError: 26474:0:(llite_lib.c:1610:ll_statfs_internal()) md_statfs
fails: rc = -16
MDS Side :
Lustre: 18484:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@
Couldn't add any time (5/-207), not sending early reply
req@ffff8800514b2400 x1465479219941096/t0(0)
o101->60155cc5-7c3a-a0af-08a5-19451109c288@10.64.18.11@tcp1:0/0 lens
576/1152 e 5 to 0 dl 1398454573 ref 2 fl Interpret:/0/0 rc 0/0
Lustre: data1-MDT0000: Client 60155cc5-7c3a-a0af-08a5-19451109c288 (at
10.64.18.11@tcp1) reconnecting
Lustre: data1-MDT0000: Client 60155cc5-7c3a-a0af-08a5-19451109c288 (at
10.64.18.11@tcp1) refused reconnection, still busy with 1 active RPCs
Lustre: data1-MDT0000: Client 60155cc5-7c3a-a0af-08a5-19451109c288 (at
10.64.18.11@tcp1) reconnecting
Lustre: Skipped 1 previous similar message
Lustre: data1-MDT0000: Client 60155cc5-7c3a-a0af-08a5-19451109c288 (at
10.64.18.11@tcp1) refused reconnection, still busy with 1 active RPCs
Lustre: Skipped 1 previous similar message
Lustre: data1-MDT0000: Client 60155cc5-7c3a-a0af-08a5-19451109c288 (at
10.64.18.11@tcp1) reconnecting
Lustre: Skipped 1 previous similar message
Lustre: data1-MDT0000: Client 60155cc5-7c3a-a0af-08a5-19451109c288 (at
10.64.18.11@tcp1) refused reconnection, still busy with 1 active RPCs
Lustre: Skipped 1 previous similar message
Lustre: data1-MDT0000: Client 60155cc5-7c3a-a0af-08a5-19451109c288 (at
10.64.18.11@tcp1) reconnecting
Lustre: Skipped 1 previous similar message
Lustre: data1-MDT0000: Client 60155cc5-7c3a-a0af-08a5-19451109c288 (at
10.64.18.11@tcp1) refused reconnection, still busy with 1 active RPCs
Lustre: Skipped 1 previous similar message
Can you provide me somes advice for this issue ?
In Jira HPDD, I have found this issue : LU-793 (
https://jira.hpdd.intel.com/browse/LU-793) but many other tickets seems to
relate to my case ...
Do you think the LU-793 is the good one ?
In this case Peter list 3 patchs :
http://review.whamcloud.com/#/c/9209/
http://review.whamcloud.com/#/c/9210/
http://review.whamcloud.com/#/c/9211/
Are they production ready for 2.4 Release ?
Do you think there is an other way for the solution ?
Cheers, Jaime