> The mixed mapping problem is made slightly more difficult by the fact
> that we add persistent memory to the direct-map when allocating struct
> page, but probably not insurmountable. Also, this still has the
> syscall overhead that a MAP_SYNC semantic eliminates, but we need to
> collect numbers to see if that matters.
> However, chatting with Andy R. about the NVML use case, the library
> alternates between streaming non-temporal writes and byte-accesses +
> clwb(). The byte accesses get slower with a write-through mapping.
> So, performance data is needed all around to see where these options
When you say "byte-access + clwb()", do you mean literally write a
byte, clwb, write a byte, clwb... or do you mean lots of byte accesses
and then one clwb? If the former, I suspect it could be changed to
non-temporal store + sfence and be faster.
Typically a mixture. That is, there are times where we store a pointer
and follow it immediately with CLWB, and there are times where we do
lots of work and then decide to commit what we've done by running over
a range doing CLWB. In our libraries, NT stores are easy to use because
we control the code. But one of the benefits of pmem is that applications
can access data structures in-place, without calling through APIs for
every pointer de-reference, so it gets sort of impractical to require
NT stores. Imagine, for example, as part of an update to pmem you want
to strcpy() or sprintf() or some other function you didn't write. Following
that with a call to a commit API that flushes things is easier on the
app developer than requiring them to have NT store versions of all those
My understanding is that non-temporal store + sfence doesn't
the cache, though, which is unfortunate for some use cases.
That matches my understanding.
The real solution would be for Intel to add an efficient operation
force writeback on a large region of physical pages.
This is under investigation, but unfortunately not available just yet...