On Mon, 2018-11-26 at 18:11 -0800, Dan Williams wrote:
On Thu, Nov 8, 2018 at 10:07 AM Alexander Duyck
<alexander.h.duyck(a)linux.intel.com> wrote:
>
> Move the async_synchronize_full call out of __device_release_driver and
> into driver_detach.
>
> The idea behind this is that the async_synchronize_full call will only
> guarantee that any existing async operations are flushed. This doesn't do
> anything to guarantee that a hotplug event that may occur while we are
> doing the release of the driver will not be asynchronously scheduled.
>
> By moving this into the driver_detach path we can avoid potential deadlocks
> as we aren't holding the device lock at this point and we should not have
> the driver we want to flush loaded so the flush will take care of any
> asynchronous events the driver we are detaching might have scheduled.
>
What problem is this patch solving in practice, because if there were
drivers issuing async work from probe they would need to be
responsible for flushing it themselves. That said it seems broken that
the async probing infrastructure takes the device_lock inside
async_schedule and then holds the lock when calling
async_syncrhonize_full. Is it just luck that this hasn't caused
deadlocks in practice?
My understanding is that it has caused some deadlocks. There was
another patch set that Bart Van Assche had submitted that was
addressing this. I just tweaked my patch set to address both the issues
he had seen as well as the performance improvements included in my
original patch set.
Given that the device_lock is hidden from lockdep I think it would
be
helpful to have a custom lock_map_acquire() setup, similar to the
workqueue core, to try to keep the locking rules enforced /
documented.
The only documentation I can find for async-probe deadlock avoidance
is the comment block in do_init_module() for async_probe_requested.
Would it make sense to just add any lockdep or deadlock documentation
as a seperate patch? I can work on it but I am not sure it makes sense
to add to this patch since there is a chance this one will need to be
backported to stable at some point.
Stepping back a bit, does this patch have anything to do with the
performance improvement, or is it a separate "by the way I also found
this" kind of patch?
This is more of a seperate "by the way" type of patch based on the
discussion Bart and I had about how to best address the issue. There
may be some improvement since we only call async_synchronize_full once
and only when we are removing the driver, but I don't think it would be
very noticable.