I have this problem with two of my Lustre file systems running version 2.4
on RHEL 6.4. I haven't had time to deal with it so currently my plan is
to physically restart the machine after it hangs. Not the kind of thing
you like to see.
Rob Stites
Center for Spoken Language and Understanding
Research Associate
Phone: (503) 346-3764
Email: stites(a)ohsu.edu
Oregon Health & Science University (OHSU)
3181 SW Sam Jackson Park Rd. GH40
Portland, OR 97239-3098
On 12/17/13 6:22 AM, "Andrew Wagner" <andrew.wagner(a)ssec.wisc.edu> wrote:
Hello all,
I've recently started working with Lustre and setting up a couple of new
filesystems on RHEL 6.4 w/ Lustre 2.4 from the ZFS repository (we're
using Lustre on ZFS) with an Infiniband networking infrastructure using
OpenIB from the RedHat repositories.
I've encountered a problem that I'm curious if anyone else has
encountered. When shutting down machines with Lustre OSTs mounted on
them, the default shutdown scripts cause a hang when the OpenIB modules
begin to unload. This is due to the Lustre/LNET stop scripts not
completely unloading Lustre modules. While investigating, I discovered
that the following sequence would successfully unload the Lustre modules
such that IB modules could also unload:
1. Stop Lustre
2. Stop LNET (Outputs "ERROR: Module osc has non-zero reference count.")
3. Run lustre_rmmod (Outpus "Modules still loaded:
lnet/klnds/o2iblnd/ko2iblnd.o lnet/lnet/lnet.o libcfs/libcfs/libcfs.o
4. Stop LNET again to unload the three remaining modules.
I've written this into a shutdown script, which works as a solution, but
does not address the underlying problem.
Has anyone else seen this behavior?
--
Andrew Wagner
Research System Administrator
Technical Computing
UW-Space Science and Engineering
AOSS Room 439
_______________________________________________
HPDD-discuss mailing list
HPDD-discuss(a)lists.01.org
https://lists.01.org/mailman/listinfo/hpdd-discuss