On Tue, Feb 23, 2016 at 04:10:50PM +0200, Boaz Harrosh wrote:
On 02/23/2016 11:52 AM, Christoph Hellwig wrote:
<>
>
> And this is BS. Using msync or fsync might not perform as well as not
> actually using them, but without them you do not get persistence. If
> you use your pmem as a throw away cache that's fine, but for most people
> that is not the case.
>
Hi Christoph
So is exactly my suggestion. My approach is *not* the we do not call
m/fsync to let the FS clean up.
In my model we still do that, only we eliminate the m/fsync slowness
and the all page faults overhead by being instructed by the application
that we do not need to track the data modified cachelines. Since the
application is telling us that it will do so.
In my model the job is split:
App will take care of data persistence by instructing a MAP_PMEM_AWARE,
and doing its own cl_flushing / movnt.
Which is the heavy cost
The FS will keep track of the Meta-Data persistence as it already does, via the
call to m/fsync. Which is marginal performance compared to the above heavy
IO.
Note that the FS is still free to move blocks around, as Dave said:
lockout pagefaultes, unmap from user space, let app fault again on a new
block. this will still work as before, already in COW we flush the old
block so there will be no persistence lost.
So this all thread started with my patches, and my patches do not say
"no m/fsync" they say, make this 3-8 times faster than today if the app
is participating in the heavy lifting.
Please tell me what you find wrong with my approach?
It seems like we are trying to solve a couple of different problems:
1) Make page faults faster by skipping any radix tree insertions, tag updates,
etc.
2) Make fsync/msync faster by not flushing data that the application says it
is already making durable from userspace.
I agree that your approach seems to improve both of these problems, but I
would argue that it is an incomplete solution for problem #2 because a
fsync/msync from the PMEM aware application would still flush any radix tree
entries from *other* threads that were writing to the same file.
It seems like a more direct solution for #2 above would be to have a
metadata-only equivalent of fsync/fdatasync, say "fmetasync", which says
"I'll
make the writes I do to my mmaps durable from userspace, but I need you to
sync all filesystem metadata for me, please".
This would allow a complete separation of data synchronization in userspace
from metadata synchronization in kernel space by the filesystem code.
By itself a fmetasync() type solution of course would do nothing for issue #1
- if that was a compelling issue you'd need something like the mmap tag you're
proposing to skip work on page faults.
All that being said, though, I agree with others in the thread that we should
still be focused on correctness, as we have a lot of correctness issues
remaining. When we eventually get to the place where we are trying to do
performance optimizations, those optimizations should be measurement driven.
- Ross