I found how to disable the "panic_on_lbug". It seems the culprit is LFSCK:
Sep 13 15:10:28 n00a kernel: [ 8414.600584] LustreError:
11696:0:(osd_handler.c:936:osd_trans_start()) ASSERTION( get_current()->journal_info ==
((void *)0) ) failed:
Sep 13 15:10:28 n00a kernel: [ 8414.612825] LustreError:
11696:0:(osd_handler.c:936:osd_trans_start()) LBUG
Sep 13 15:10:28 n00a kernel: [ 8414.619833] Pid: 11696, comm: lfsck
Sep 13 15:10:28 n00a kernel: [ 8414.619835]
Sep 13 15:10:28 n00a kernel: [ 8414.619835] Call Trace:
Sep 13 15:10:28 n00a kernel: [ 8414.619850] [<ffffffffa0224822>]
libcfs_debug_dumpstack+0x52/0x80 [libcfs]
Sep 13 15:10:28 n00a kernel: [ 8414.619857] [<ffffffffa0224db2>]
lbug_with_loc+0x42/0xa0 [libcfs]
Sep 13 15:10:28 n00a kernel: [ 8414.619864] [<ffffffffa0b11890>]
osd_trans_start+0x250/0x630 [osd_ldiskfs]
Sep 13 15:10:28 n00a kernel: [ 8414.619870] [<ffffffffa0b0e748>] ?
osd_declare_xattr_set+0x58/0x230 [osd_ldiskfs]
Sep 13 15:10:28 n00a kernel: [ 8414.619876] [<ffffffffa0c6ffc7>]
lod_trans_start+0x177/0x200 [lod]
Sep 13 15:10:28 n00a kernel: [ 8414.619881] [<ffffffffa0cbd752>]
lfsck_namespace_double_scan+0x1122/0x1e50 [lfsck]
Sep 13 15:10:28 n00a kernel: [ 8414.619888] [<ffffffff8136741b>] ?
thread_return+0x3e/0x10c
Sep 13 15:10:28 n00a kernel: [ 8414.619894] [<ffffffff81038b87>] ?
enqueue_task_fair+0x58/0x5d
Sep 13 15:10:28 n00a kernel: [ 8414.619899] [<ffffffffa0cb68ea>]
lfsck_double_scan+0x5a/0x70 [lfsck]
Sep 13 15:10:28 n00a kernel: [ 8414.619904] [<ffffffffa0cb7dfd>]
lfsck_master_engine+0x50d/0x650 [lfsck]
Sep 13 15:10:28 n00a kernel: [ 8414.619909] [<ffffffffa0cb78f0>] ?
lfsck_master_engine+0x0/0x650 [lfsck]
Sep 13 15:10:28 n00a kernel: [ 8414.619915] [<ffffffff810534c4>] kthread+0x7b/0x83
Sep 13 15:10:28 n00a kernel: [ 8414.619918] [<ffffffff810369d3>] ?
finish_task_switch+0x48/0xb9
Sep 13 15:10:28 n00a kernel: [ 8414.619924] [<ffffffff8101092a>]
child_rip+0xa/0x20
Sep 13 15:10:28 n00a kernel: [ 8414.619928] [<ffffffff81053449>] ?
kthread+0x0/0x83
Sep 13 15:10:28 n00a kernel: [ 8414.619931] [<ffffffff81010920>] ?
child_rip+0x0/0x20
What I don't understand is that:
1. I had the LFSCK launched in "dry-run" mode:
lctl lfsck_start --device lustre-1-MDT0000 --dryrun on --type namespace
2. The LFSCK was reported completed before the LBUG popped-up; now, I can't even get
any output
cat /proc/fs/lustre/mdd/lustre-2-MDT0000/lfsck_namespace # just hang there indefinitely
I remember seing a lfsck_namespace file in the MDT underlyding LDISKFS; is there anything
sensible I can do with it ?
Thanks and best,
Cédric
On 13/09/16 13:04, Cédric Dufour - Idiap Research Institute wrote:
Hello,
Last Friday, during normal operations, our MDS froze with the following
LBUG, which happens again as soon as one mounts the MDT again:
Sep 13 12:45:58 n00a kernel: [ 1002.705346] Lustre: lustre-1-MDT0000:
used disk, loading
Sep 13 12:45:58 n00a kernel: [ 1002.741484] LustreError:
6265:0:(sec_config.c:1121:sptlrpc_target_local_read_conf()) missing llog
context
Sep 13 12:46:00 n00a kernel: [ 1004.771365] LustreError: 11-0:
lustre-1-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation
mds_connect failed with -11.
Sep 13 12:46:00 n00a kernel: [ 1004.783359] Lustre: lustre-1-MDT0000:
Imperative Recovery enabled, recovery window shrunk from 300-900 down to
150-450
Sep 13 12:46:00 n00a kernel: [ 1005.073160] Lustre: lustre-1-MDT0000:
Will be in recovery for at least 2:30, or until 179 clients reconnect
Sep 13 12:46:05 n00a kernel: [ 1010.228502] LustreError:
6307:0:(osd_handler.c:936:osd_trans_start()) ASSERTION(
get_current()->journal_info == ((void *)0) ) failed:
Sep 13 12:46:05 n00a kernel: [ 1010.240617] LustreError:
6307:0:(osd_handler.c:936:osd_trans_start()) LBUG
Our setup is Lustre 2.5.2 and the following debug classes enabled:
n00a:~ # cat /proc/sys/lnet/debug
ioctl neterror warning error emerg ha config console
I've had a look at:
-
https://jira.hpdd.intel.com/browse/LU-6556
-
https://jira.hpdd.intel.com/browse/LU-6634
-
https://jira.hpdd.intel.com/browse/LU-7138
but:
- changelog_* files have 0 bytes and proper root permissions
- as far as I am able to tell, we have no Changelog actually registered
The node freezes as soon as the LBUG happens and no debug log gets
written to /tmp.
Based on the console output (see above), there is no preliminary error
that may explain how we stumble on that LBUG
I've run a file-system check on the corresponding ldiskfs device; errors
were fixed but a second dry-run reported nothing dangling.
What can I do solve that situation ?
Best regards,
Cédric