On Tue, Apr 09, 2019 at 12:45:25PM -0400, Mathieu Desnoyers wrote:
----- On Apr 9, 2019, at 12:40 PM, paulmck paulmck(a)linux.ibm.com
wrote:
> On Tue, Apr 09, 2019 at 11:56:03AM -0400, Mathieu Desnoyers wrote:
>> ----- On Apr 9, 2019, at 11:40 AM, Joel Fernandes, Google
joel(a)joelfernandes.org
>> wrote:
>>
>> > On Mon, Apr 08, 2019 at 01:24:47PM -0400, Mathieu Desnoyers wrote:
>> >> ----- On Apr 8, 2019, at 11:46 AM, paulmck paulmck(a)linux.ibm.com
wrote:
>> >>
>> >> > On Mon, Apr 08, 2019 at 10:49:32AM -0400, Mathieu Desnoyers
wrote:
>> >> >> ----- On Apr 8, 2019, at 10:22 AM, paulmck
paulmck(a)linux.ibm.com wrote:
>> >> >>
>> >> >> > On Mon, Apr 08, 2019 at 09:05:34AM -0400, Mathieu
Desnoyers wrote:
>> >> >> >> ----- On Apr 7, 2019, at 10:27 PM, paulmck
paulmck(a)linux.ibm.com wrote:
>> >> >> >>
>> >> >> >> > On Sun, Apr 07, 2019 at 09:07:18PM +0000, Joel
Fernandes wrote:
>> >> >> >> >> On Sun, Apr 07, 2019 at 04:41:36PM -0400,
Mathieu Desnoyers wrote:
>> >> >> >> >> >
>> >> >> >> >> > ----- On Apr 7, 2019, at 3:32 PM, Joel
Fernandes, Google joel(a)joelfernandes.org
>> >> >> >> >> > wrote:
>> >> >> >> >> >
>> >> >> >> >> > > On Sun, Apr 07, 2019 at 03:26:16PM
-0400, Mathieu Desnoyers wrote:
>> >> >> >> >> > >> ----- On Apr 7, 2019, at 9:59
AM, paulmck paulmck(a)linux.ibm.com wrote:
>> >> >> >> >> > >>
>> >> >> >> >> > >> > On Sun, Apr 07, 2019 at
06:39:41AM -0700, Paul E. McKenney wrote:
>> >> >> >> >> > >> >> On Sat, Apr 06, 2019
at 07:06:13PM -0400, Joel Fernandes wrote:
>> >> >> >> >> > >> >
>> >> >> >> >> > >> > [ . . . ]
>> >> >> >> >> > >> >
>> >> >> >> >> > >> >> > > diff --git
a/include/asm-generic/vmlinux.lds.h
>> >> >> >> >> > >> >> > >
b/include/asm-generic/vmlinux.lds.h
>> >> >> >> >> > >> >> > > index
f8f6f04c4453..c2d919a1566e 100644
>> >> >> >> >> > >> >> > > ---
a/include/asm-generic/vmlinux.lds.h
>> >> >> >> >> > >> >> > > +++
b/include/asm-generic/vmlinux.lds.h
>> >> >> >> >> > >> >> > > @@ -338,6
+338,10 @@
>> >> >> >> >> > >> >> > >
KEEP(*(__tracepoints_ptrs)) /* Tracepoints: pointer array */ \
>> >> >> >> >> > >> >> > >
__stop___tracepoints_ptrs = .; \
>> >> >> >> >> > >> >> > >
*(__tracepoints_strings)/* Tracepoints: strings */ \
>> >> >> >> >> > >> >> > > + . =
ALIGN(8); \
>> >> >> >> >> > >> >> > >
+ __start___srcu_struct = .; \
>> >> >> >> >> > >> >> > >
+ *(___srcu_struct_ptrs) \
>> >> >> >> >> > >> >> > >
+ __end___srcu_struct = .; \
>> >> >> >> >> > >> >> > >
} \
>> >> >> >> >> > >> >> >
>> >> >> >> >> > >> >> > This vmlinux
linker modification is not needed. I tested without it and srcu
>> >> >> >> >> > >> >> > torture works
fine with rcutorture built as a module. Putting further prints
>> >> >> >> >> > >> >> > in
kernel/module.c verified that the kernel is able to find the srcu structs
>> >> >> >> >> > >> >> > just fine. You
could squash the below patch into this one or apply it on top
>> >> >> >> >> > >> >> > of the dev
branch.
>> >> >> >> >> > >> >>
>> >> >> >> >> > >> >> Good point, given
that otherwise FORTRAN named common blocks would not
>> >> >> >> >> > >> >> work.
>> >> >> >> >> > >> >>
>> >> >> >> >> > >> >> But isn't one
advantage of leaving that stuff in the RO_DATA_SECTION()
>> >> >> >> >> > >> >> macro that it can be
mapped read-only? Or am I suffering from excessive
>> >> >> >> >> > >> >> optimism?
>> >> >> >> >> > >> >
>> >> >> >> >> > >> > And to answer the other
question, in the case where I am suffering from
>> >> >> >> >> > >> > excessive optimism, it
should be a separate commit. Please see below
>> >> >> >> >> > >> > for the updated original
commit thus far.
>> >> >> >> >> > >> >
>> >> >> >> >> > >> > And may I have your
Tested-by?
>> >> >> >> >> > >>
>> >> >> >> >> > >> Just to confirm: does the
cleanup performed in the modules going
>> >> >> >> >> > >> notifier end up acting as a
barrier first before freeing the memory ?
>> >> >> >> >> > >> If not, is it explicitly
stated that a barrier must be issued before
>> >> >> >> >> > >> module unload ?
>> >> >> >> >> > >>
>> >> >> >> >> > >
>> >> >> >> >> > > You mean rcu_barrier? It is
mentioned in the documentation that this is the
>> >> >> >> >> > > responsibility of the module
writer to prevent delays for all modules.
>> >> >> >> >> >
>> >> >> >> >> > It's a srcu barrier yes.
Considering it would be a barrier specific to the
>> >> >> >> >> > srcu domain within that module, I
don't see how it would cause delays for
>> >> >> >> >> > "all" modules if we
implicitly issue the barrier on module unload. What
>> >> >> >> >> > am I missing ?
>> >> >> >> >>
>> >> >> >> >> Yes you are right. I thought of this after I
just sent my email. I think it
>> >> >> >> >> makes sense for srcu case to do and could
avoid a class of bugs.
>> >> >> >> >
>> >> >> >> > If there are call_srcu() callbacks outstanding,
the module writer still
>> >> >> >> > needs the srcu_barrier() because otherwise
callbacks arrive after
>> >> >> >> > the module text has gone, which will be
disappoint the CPU when it
>> >> >> >> > tries fetching instructions that are no longer
mapped. If there are
>> >> >> >> > no call_srcu() callbacks from that module, then
there is no need for
>> >> >> >> > srcu_barrier() either way.
>> >> >> >> >
>> >> >> >> > So if an srcu_barrier() is needed, the module
developer needs to
>> >> >> >> > supply it.
>> >> >> >>
>> >> >> >> When you say "callbacks arrive after the module
text has gone",
>> >> >> >> I think you assume that free_module() is invoked
before the
>> >> >> >> MODULE_STATE_GOING notifiers are called. But it's
done in the
>> >> >> >> opposite order: going notifiers are called first, and
then
>> >> >> >> free_module() is invoked.
>> >> >> >>
>> >> >> >> So AFAIU it would be safe to issue the srcu_barrier()
from the module
>> >> >> >> going notifier.
>> >> >> >>
>> >> >> >> Or am I missing something ?
>> >> >> >
>> >> >> > We do seem to be talking past each other. ;-)
>> >> >> >
>> >> >> > This has nothing to do with the order of events at
module-unload time.
>> >> >> >
>> >> >> > So please let me try again.
>> >> >> >
>> >> >> > If a given srcu_struct in a module never has call_srcu()
invoked, there
>> >> >> > is no need to invoke rcu_barrier() at any time, whether
at module-unload
>> >> >> > time or not. Adding rcu_barrier() in this case adds
overhead and latency
>> >> >> > for no good reason.
>> >> >>
>> >> >> Not if we invoke srcu_barrier() for that specific domain. If
>> >> >> call_srcu was never invoked for a srcu domain, I don't see
why
>> >> >> srcu_barrier() should be more expensive than a simple check
that
>> >> >> the domain does not have any srcu work queued.
>> >> >
>> >> > But that simple check does involve a cache miss for each possible
CPU (not
>> >> > just each online CPU), so it is non-trivial, especially on large
systems.
>> >> >
>> >> >> > If a given srcu_struct in a module does have at least one
call_srcu()
>> >> >> > invoked, it is already that module's responsibility
to make sure that
>> >> >> > the code sticks around long enough for the callback to be
invoked.
>> >> >>
>> >> >> I understand that when users do explicit dynamic
allocation/cleanup of
>> >> >> srcu domains, they indeed need to take care of doing explicit
srcu_barrier().
>> >> >> However, if they do static definition of srcu domains, it
would be nice
>> >> >> if we can handle the barriers under the hood.
>> >> >
>> >> > All else being equal, of course. But...
>> >> >
>> >> >> > This means that correct SRCU users that invoke
call_srcu() already
>> >> >> > have srcu_barrier() at module-unload time. Incorrect
SRCU users, with
>> >> >> > reasonable probability, now get a WARN_ON() at
module-unload time, with
>> >> >> > the per-CPU state getting leaked. Before this change,
they would (also
>> >> >> > with reasonable probability) instead get an
instruction-fetch fault when
>> >> >> > the SRCU callback was invoked after the completion of the
module unload.
>> >> >> > Furthermore, in all cases where they would previously
have gotten the
>> >> >> > instruction-fetch fault, they now get the WARN_ON(), like
this:
>> >> >> >
>> >> >> > if
(WARN_ON(rcu_segcblist_n_cbs(&sdp->srcu_cblist)))
>> >> >> > return; /* Forgot srcu_barrier(), so just leak it! */
>> >> >> >
>> >> >> > So this change already represents an improvement in
usability.
>> >> >>
>> >> >> Considering that we can do a srcu_barrier() for the specific
domain,
>> >> >> and that it should add no noticeable overhead if there is no
queued
>> >> >> callbacks, I don't see a good reason for leaving the
srcu_barrier
>> >> >> invocation to the user rather than implicitly doing it from
the
>> >> >> module going notifier.
>> >> >
>> >> > Now, I could automatically add an indicator of whether or not a
>> >> > call_srcu() had happened, but then again, that would either add a
>> >> > call_srcu() scalability bottleneck or again require a scan of all
possible
>> >> > CPUs... to figure out if it was necessary to scan all possible
CPUs.
>> >> >
>> >> > Or is scanning all possible CPUs down in the noise in this case?
Or
>> >> > am I missing a trick that would reduce the overhead?
>> >>
>> >> Module unloading implicitly does a synchronize_rcu (for RCU-sched),
and
>> >> a stop_machine. So I would be tempted to say that overhead of
iteration
>> >> over all CPUs might not matter that much considering the rest.
>> >>
>> >> About notifying that a call_srcu has happened for the srcu domain in a
>> >> scalable fashion, let's see... We could have a flag
"call_srcu_used"
>> >> for each call_srcu domain. Whenever call_srcu is invoked, it would
>> >> load that flag. It sets it on first use.
>> >>
>> >> The idea here is to only use that flag when srcu_barrier is performed
>> >> right before the srcu domain cleanup (it could become part of that
>> >> cleanup). Else, using it in all srcu_barrier() might be tricky,
because
>> >> we may then need to add memory barriers or locking to the call_srcu
>> >> fast-path, which is an overhead we try to avoid.
>> >>
>> >> However, if we only use that flag as part of the srcu domain cleanup,
>> >> it's already prohibited to invoke call_srcu concurrently with the
>> >> cleanup of the same domain, so I don't think we would need any
>> >> memory barriers in call_srcu.
>> >
>> > About the last part of your email, it seems to that if after call_srcu has
>> > returned, if the module could be unloaded on some other CPU - then it
would
>> > need to see the flag stored by the preceding call_srcu, so I believe there
>> > would be a memory barrier between the two opreations (call_srcu and module
>> > unload).
>>
>> In order for the module unload not to race against module execution, it needs
>> to happen after the call_srcu in a way that is already ordered by other means,
>> else module unload races against the module code.
>>
>> >
>> > Also about doing the unconditional srcu_barrier, since a module could be
>> > unloaded at any time - don't all SRCU using modules need to invoke
>> > srcu_barrier() during their clean up anyway so we are incurring the
barrier
>> > overhead anyway? Or, am I missing a design pattern here? It seems to me
>> > rcutorture module definitely calls srcu_barrier() before it is unloaded.
>>
>> I think a valid approach which is even simpler might be: if a module statically
>> defines a SRCU domain, it should be expected to use it. So adding a
>> srcu_barrier()
>> to its module going notifier should not hurt. The rare case where a module
>> defines
>> a static SRCU domain *and* does not actually use it with call_srcu() does not
>> seem that usual, and not worth optimizing for.
>>
>> Thoughts ?
>
> Most SRCU users use only synchronize_srcu(), and don't ever use
> call_srcu(). Which is not too surprising given that call_srcu() showed
> up late in the game.
>
> But something still bothers me about this, and I am not yet sure
> what. One thing that seems to reduce anxiety somewhat is doing the
> srcu_barrier() on all calls to cleanup_srcu_struct() rather than just
> those invoked from the modules infrastructure, but I don't see why at
> the moment.
Indeed, providing similar guarantees for the dynamic allocation case
would be nice.
The one thing that is making me anxious here is use-cases where
users would decide to chain their call_srcu(). Then they would
need as many srcu_barrier() as chain hops. This would be a valid
reason for leaving invocation of srcu_barrier() to the user and
not hide it under the hood.
Thoughts ?
The current state is not horrible, so my thought would be to give it
some time to see if better thoughts arise.
Either way, cleanup_srcu_struct() keeps its current checks for callbacks
still being in flight, which is why I believe that the current state is
not horrible. ;-)
Thanx, Paul