On Thu, May 16, 2019 at 9:45 AM Jane Chu <jane.chu(a)oracle.com> wrote:
I'm able to reproduce the panic below by running two sets of ndctl
commands that actually serve legitimate purpose in parallel (unlike
the brute force experiment earlier), each set in a indefinite loop.
This time it takes about an hour to panic. But I gather the cause
is probably the same: I've overlapped ndctl commands on the same
Could we add a check in nd_ioctl(), such that if there is
an ongoing ndctl command on a region, subsequent ndctl request
will fail immediately with something to the effect of EAGAIN?
The rationale being that kernel should protect itself against
We do already have locking in the driver to prevent configuration
collisions. The problem looks to be broken assumptions about running
the device unregistration path in a separate thread outside the lock.
I suspect it may be incorrect assumptions about the userspace
visibility of the device relative to teardown actions. To be clear
this isn't the nd_ioctl() path this is the sysfs path.
Also, sensing the subject fix is for a different problem, and has
verified, I'm happy to see it in upstream, so we have a better
code base to digger deeper in terms of how the destructive ndctl
commands interacts to typical mission critical applications, include
but not limited to rdma.
Right, the crash signature you are seeing looks unrelated to the issue
being address in these patches which is device-teardown racing active
page pins. I'll start the investigation on the crash signature, but
again I don't think it reads on this fix series.