On Tue, 31 Mar 2020, Elliott, Robert (Servers) wrote:
> -----Original Message-----
> From: Mikulas Patocka <mpatocka(a)redhat.com>
> Sent: Monday, March 30, 2020 6:32 AM
> To: Dan Williams <dan.j.williams(a)intel.com>; Vishal Verma
> <vishal.l.verma(a)intel.com>; Dave Jiang <dave.jiang(a)intel.com>; Ira
> Weiny <ira.weiny(a)intel.com>; Mike Snitzer <msnitzer(a)redhat.com>
> Cc: linux-nvdimm(a)lists.01.org; dm-devel(a)redhat.com
> Subject: [PATCH v2] memcpy_flushcache: use cache flusing for larger
> I tested dm-writecache performance on a machine with Optane nvdimm
> and it turned out that for larger writes, cached stores + cache
> flushing perform better than non-temporal stores. This is the
> throughput of dm- writecache measured with this command:
> dd if=/dev/zero of=/dev/mapper/wc bs=64 oflag=direct
> block size 512 1024 2048 4096
> movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s
> clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s
> We can see that for smaller block, movnti performs better, but for
> larger blocks, clflushopt has better performance.
There are other interactions to consider... see threads from the last
few years on the linux-nvdimm list.
dm-writecache is the only linux driver that uses memcpy_flushcache on
persistent memory. There is also the btt driver, it uses the "do_io"
method to write to persistent memory and I don't know where this method
Anyway, if patching memcpy_flushcache conflicts with something else, we
should introduce memcpy_flushcache_to_pmem.
For example, software generally expects that read()s take a long time
avoids re-reading from disk; the normal pattern is to hold the data in
memory and read it from there. By using normal stores, CPU caches end up
holding a bunch of persistent memory data that is probably not going to
be read again any time soon, bumping out more useful data. In contrast,
movnti avoids filling the CPU caches.
But if I write one cacheline and flush it immediatelly, it would consume
just one associative entry in the cache.
Another option is the AVX vmovntdq instruction (if available), the
most recent of which does 64-byte (cache line) sized transfers to
zmm registers. There's a hefty context switching overhead (e.g.,
304 clocks), and the CPU often runs AVX instructions at a slower
clock frequency, so it's hard to judge when it's worthwhile.
The benchmark shows that 64-byte non-temporal avx512 vmovntdq is as good
as 8, 16 or 32-bytes writes.
sequential write-nt 4 bytes 4.1 GB/s 1.3 GB/s
sequential write-nt 8 bytes 4.1 GB/s 1.3 GB/s
sequential write-nt 16 bytes (sse) 4.1 GB/s 1.3 GB/s
sequential write-nt 32 bytes (avx) 4.2 GB/s 1.3 GB/s
sequential write-nt 64 bytes (avx512) 4.1 GB/s 1.3 GB/s
With cached writes (where each cache line is immediatelly followed by clwb
or clflushopt), 8, 16 or 32-byte write performs better than non-temporal
stores and avx512 performs worse.
sequential write 8 + clwb 5.1 GB/s 1.6 GB/s
sequential write 16 (sse) + clwb 5.1 GB/s 1.6 GB/s
sequential write 32 (avx) + clwb 4.4 GB/s 1.5 GB/s
sequential write 64 (avx512) + clwb 1.7 GB/s 0.6 GB/s
In user space, glibc faces similar choices for its memcpy()
glibc memcpy() uses non-temporal stores for transfers > 75% of the
L3 cache size divided by the number of cores. For example, with
glibc-2.216-16.fc27 (August 2017), on a Broadwell system with
E5-2699 36 cores 45 MiB L3 cache, non-temporal stores are used
for memcpy()s over 36 MiB.
BTW. what does glibc do with reads? Does it flush them from the cache
after they are consumed?
AFAIK glibc doesn't support persistent memory - i.e. there is no function
that flushes data and the user has to use inline assembly for that.
It'd be nice if glibc, PMDK, and the kernel used the same