On 18-11-09 16:46:02, Alexander Duyck wrote:
On Fri, 2018-11-09 at 19:00 -0500, Pavel Tatashin wrote:
> On 18-11-09 15:14:35, Alexander Duyck wrote:
> > On Fri, 2018-11-09 at 16:15 -0500, Pavel Tatashin wrote:
> > > On 18-11-05 13:19:25, Alexander Duyck wrote:
> > > > This patchset is essentially a refactor of the page initialization
logic
> > > > that is meant to provide for better code reuse while providing a
> > > > significant improvement in deferred page initialization performance.
> > > >
> > > > In my testing on an x86_64 system with 384GB of RAM and 3TB of
persistent
> > > > memory per node I have seen the following. In the case of regular
memory
> > > > initialization the deferred init time was decreased from 3.75s to
1.06s on
> > > > average. For the persistent memory the initialization time dropped
from
> > > > 24.17s to 19.12s on average. This amounts to a 253% improvement for
the
> > > > deferred memory initialization performance, and a 26% improvement in
the
> > > > persistent memory initialization performance.
> > >
> > > Hi Alex,
> > >
> > > Please try to run your persistent memory init experiment with
Daniel's
> > > patches:
> > >
> > >
https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jordan@oracl...
> >
> > I've taken a quick look at it. It seems like a bit of a brute force way
> > to try and speed things up. I would be worried about it potentially
>
> There is a limit to max number of threads that ktasks start. The memory
> throughput is *much* higher than what one CPU can maxout in a node, so
> there is no reason to leave the other CPUs sit idle during boot when
> they can help to initialize.
Right, but right now that limit can still be pretty big when it is
something like 25% of all the CPUs on a 288 CPU system.
It is still OK. About 9 threads per node.
That machine has 1T of memory, which means 8 nodes need to initialize 2G
of memory each. With 46G/s throughout it should take 0.043s Which is 10
times higher than what Daniel sees with 0.325s, so there is still room
to saturate the memory throughput.
Now, if the multi-threadding efficiency is good, it should take
1.261s / 9 threads = 0.14s
One issue is the way the code was ends up essentially blowing out the
cache over and over again. Doing things in two passes made it really
expensive as you took one cache miss to initialize it, and another to
free it. I think getting rid of that is one of the biggest gains with
my patch set.
I am not arguing that your patches make sense, all I am saying that
ktasks improve time order of magnitude better on machines with large
amount of memory.
Pasha