Dear  Lustre Experts,


We currently use Lustre Maintenance Release (Build Wahmcloud) 2.4.2 on RHEL6U4 Linux.

We have an issue on Client side with a failure reconnection on the client due to refused connection on the MDS.

We need to reboot the MDS in order to recover the client status.


Client Side :


Lustre: 8863:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1398453761/real 1398453761]  req@ffff88009e281800 x1465479219941096/t0(0) o101->data1-MDT0000-mdc-ffff880335a7bc00@10.64.18.12@tcp1:12/10 lens 576/1136 e 5 to 1 dl 1398454684 ref 2 fl Rpc:XP/0/ffffffff rc 0/-1

Lustre: data1-MDT0000-mdc-ffff880335a7bc00: Connection to data1-MDT0000 (at 10.64.18.12@tcp1) was lost; in progress operations using this service will wait for recovery to complete

LustreError: 11-0: data1-MDT0000-mdc-ffff880335a7bc00: Communicating with 10.64.18.12@tcp1, operation mds_connect failed with -16.

LustreError: 11-0: data1-MDT0000-mdc-ffff880335a7bc00: Communicating with 10.64.18.12@tcp1, operation mds_connect failed with -16.

LustreError: Skipped 1 previous similar message

LustreError: 11-0: data1-MDT0000-mdc-ffff880335a7bc00: Communicating with 10.64.18.12@tcp1, operation mds_connect failed with -16.

LustreError: Skipped 2 previous similar messages

LustreError: 26416:0:(lmv_obd.c:1289:lmv_statfs()) can't stat MDS #0 (data1-MDT0000-mdc-ffff880335a7bc00), error -16

LustreError: 26416:0:(llite_lib.c:1610:ll_statfs_internal()) md_statfs fails: rc = -16

LustreError: 11-0: data1-MDT0000-mdc-ffff880335a7bc00: Communicating with 10.64.18.12@tcp1, operation mds_connect failed with -16.

LustreError: Skipped 5 previous similar messages

LustreError: 26474:0:(lmv_obd.c:1289:lmv_statfs()) can't stat MDS #0 (data1-MDT0000-mdc-ffff880335a7bc00), error -16

LustreError: 26474:0:(llite_lib.c:1610:ll_statfs_internal()) md_statfs fails: rc = -16

 

MDS Side :

Lustre: 18484:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-207), not sending early reply
  req@ffff8800514b2400 x1465479219941096/t0(0) o101->60155cc5-7c3a-a0af-08a5-19451109c288@10.64.18.11@tcp1:0/0 lens 576/1152 e 5 to 0 dl 1398454573 ref 2 fl Interpret:/0/0 rc 0/0
Lustre: data1-MDT0000: Client 60155cc5-7c3a-a0af-08a5-19451109c288 (at 10.64.18.11@tcp1) reconnecting
Lustre: data1-MDT0000: Client 60155cc5-7c3a-a0af-08a5-19451109c288 (at 10.64.18.11@tcp1) refused reconnection, still busy with 1 active RPCs
Lustre: data1-MDT0000: Client 60155cc5-7c3a-a0af-08a5-19451109c288 (at 10.64.18.11@tcp1) reconnecting
Lustre: Skipped 1 previous similar message
Lustre: data1-MDT0000: Client 60155cc5-7c3a-a0af-08a5-19451109c288 (at 10.64.18.11@tcp1) refused reconnection, still busy with 1 active RPCs
Lustre: Skipped 1 previous similar message
Lustre: data1-MDT0000: Client 60155cc5-7c3a-a0af-08a5-19451109c288 (at 10.64.18.11@tcp1) reconnecting
Lustre: Skipped 1 previous similar message
Lustre: data1-MDT0000: Client 60155cc5-7c3a-a0af-08a5-19451109c288 (at 10.64.18.11@tcp1) refused reconnection, still busy with 1 active RPCs
Lustre: Skipped 1 previous similar message
Lustre: data1-MDT0000: Client 60155cc5-7c3a-a0af-08a5-19451109c288 (at 10.64.18.11@tcp1) reconnecting
Lustre: Skipped 1 previous similar message
Lustre: data1-MDT0000: Client 60155cc5-7c3a-a0af-08a5-19451109c288 (at 10.64.18.11@tcp1) refused reconnection, still busy with 1 active RPCs
Lustre: Skipped 1 previous similar message

Can you provide me somes advice for this issue ?


In  Jira HPDD, I have found this issue : LU-793 (https://jira.hpdd.intel.com/browse/LU-793) but many other tickets seems to relate to my case ...


Do you think the LU-793 is the good one ?


In this case Peter list 3 patchs :

http://review.whamcloud.com/#/c/9209/
http://review.whamcloud.com/#/c/9210/
http://review.whamcloud.com/#/c/9211/

Are they production ready for 2.4 Release ?


Do you think there is an other way for the solution ?


Cheers, Jaime