Hi all,

As of late I am experiencing kernel oops a lot, it seems to happen when users try to list files on lustre.

We recently updated our version of Debian, however the problem seems to only occur on the submit node of the cluster to which the users have cli access, other nodes aren't experiencing problems.


Further down is a trace of one such event, I am not really 100% sure how to troubleshoot from here.

Technical details:
OS: Debian testing/sid mix, 64 bit
Kernel: 3.17.1 (kernel.org locally compiled)
lustre version: 2.5.3

What we did so far:
- I switched between two nodes (changed dns) to see if it was a hw problem, the problem migrated with the submit hosts' location

I am considering recompiling the client, since it was compiled under our previous debian freeze (as far as I remember), however the problem seems to be in the kernel which is independent of our Debian freezes....

Thanks,
Eli



Feb  2 15:43:04 hm-01 kernel: BUG: unable to handle kernel paging request at 0000003212ced00a
Feb  2 15:43:04 hm-01 kernel: IP: [<ffffffffa182467e>] ll_get_dir_page+0x7a7/0xf21 [lustre]
Feb  2 15:43:04 hm-01 kernel: PGD f1c419067 PUD 0
Feb  2 15:43:04 hm-01 kernel: Oops: 0000 [#1] SMP
Feb  2 15:43:04 hm-01 kernel: Modules linked in: lmv(C) fld(C) mgc(C) lustre(C) mdc(C) fid(C) lov(C) osc(C) ptlrpc(C) obdclass(C) lvfs(C) binfmt_misc ko2iblnd(C) evdev joydev lnet(C) sha512_generic intel_rapl x86_pkg_temp_thermal coretemp sha256_generic kvm sb_edac ipmi_si processor ipmi_msghandler dcdbas edac_core crc32_pclmul pcspkr sg wmi button sha1_ssse3 sha1_generic crc32 mei_me mei libcfs(C) fuse parport_pc lp parport dm_crypt dm_mod autofs4
Feb  2 15:43:04 hm-01 kernel: CPU: 2 PID: 2883 Comm: csh Tainted: G         C     3.17.1-aufs-2 #1
Feb  2 15:43:04 hm-01 kernel: Hardware name: Dell Inc. PowerEdge C6220/0TTH1R, BIOS 1.2.1 05/27/2013
Feb  2 15:43:04 hm-01 kernel: task: ffff880857a00000 ti: ffff880f1feec000 task.ti: ffff880f1feec000
Feb  2 15:43:04 hm-01 kernel: RIP: 0010:[<ffffffffa182467e>]  [<ffffffffa182467e>] ll_get_dir_page+0x7a7/0xf21 [lustre]
Feb  2 15:43:04 hm-01 kernel: RSP: 0018:ffff880f1feefcf8  EFLAGS: 00010002
Feb  2 15:43:04 hm-01 kernel: RAX: 0000000000000001 RBX: ffff88083e313bc8 RCX: 0000000000000000
Feb  2 15:43:04 hm-01 kernel: RDX: 0000003212ced00a RSI: ffff880f1feefcb0 RDI: ffff88087f0dd6c8
Feb  2 15:43:04 hm-01 kernel: RBP: ffff880f1feefdf0 R08: 0000000000000000 R09: 000000000000003c
Feb  2 15:43:04 hm-01 kernel: R10: 0000000000000000 R11: ffff8806a4733b48 R12: fffffffffffffffe
Feb  2 15:43:04 hm-01 kernel: R13: 0000000000000000 R14: 0000003212ced00a R15: ffff88083e313d10
Feb  2 15:43:04 hm-01 kernel: FS:  00007f3093d57700(0000) GS:ffff88087fa40000(0000) knlGS:0000000000000000
Feb  2 15:43:04 hm-01 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb  2 15:43:04 hm-01 kernel: CR2: 0000003212ced00a CR3: 000000104f23c000 CR4: 00000000000407e0
Feb  2 15:43:04 hm-01 kernel: Stack:
Feb  2 15:43:04 hm-01 kernel: 0000000000018800 ffff88083e313d10 fffffffffffffffe ffff88083e313e40
Feb  2 15:43:04 hm-01 kernel: 6dbf322dc805b051 0000000000000000 ffff88083e313bc8 0000000000000000
Feb  2 15:43:04 hm-01 kernel: 0000000000000000 ffff880f1feefe30 ffff880f1feeff20 0000000000000002
Feb  2 15:43:04 hm-01 kernel: Call Trace:
Feb  2 15:43:04 hm-01 kernel: [<ffffffffa1824e80>] ll_dir_read+0x88/0x2a5 [lustre]
Feb  2 15:43:04 hm-01 kernel: [<ffffffffa1825160>] ll_readdir+0xc3/0x1dd [lustre]
Feb  2 15:43:04 hm-01 kernel: [<ffffffff81105cdd>] iterate_dir+0x86/0x10a
Feb  2 15:43:04 hm-01 kernel: [<ffffffff811060ec>] SyS_getdents+0x86/0xdb
Feb  2 15:43:04 hm-01 kernel: [<ffffffff81105e28>] ? fillonedir+0xc7/0xc7
Feb  2 15:43:04 hm-01 kernel: [<ffffffff81816eed>] system_call_fastpath+0x1a/0x1f
Feb  2 15:43:04 hm-01 kernel: [<ffffffff81816eed>] ? system_call_fastpath+0x1a/0x1f
Feb  2 15:43:04 hm-01 kernel: Code: ff ff e8 5a 22 ff df 48 8b 95 18 ff ff ff 49 8d 7f 08 4c 89 f6 b9 01 00 00 00 e8 40 79 af df 85 c0 0f 8e 0d 01 00 00 4c 8b 75 90 <49> 8b 06 f6 c4 80 75 07 f0 41 ff 46 1c eb 0c 4c 89 f7 e8 9a 57
Feb  2 15:43:04 hm-01 kernel: RIP  [<ffffffffa182467e>] ll_get_dir_page+0x7a7/0xf21 [lustre]
Feb  2 15:43:04 hm-01 kernel: RSP <ffff880f1feefcf8>
Feb  2 15:43:04 hm-01 kernel: CR2: 0000003212ced00a
Feb  2 15:43:04 hm-01 kernel: ---[ end trace 20a12192acce9089 ]---