On Wed, Feb 24, 2016 at 10:02:26AM -0500, Jeff Moyer wrote:
Dan Williams <dan.j.williams(a)intel.com> writes:
>> I see. So I think your argument is that new file systems (such as Nova)
>> can have whacky new semantics, but existing file systems should provide
>> the more conservative semantics that they have provided since the dawn
>> of time (even if we add a new mmap flag to control the behavior).
>> I don't agree with that. :)
> Fair enough. Recall, I was pushing MAP_DAX not to long ago. It just
> seems like a Sisyphean effort to push an mmap flag up the XFS hill and
> maybe that effort is better spent somewhere else.
Given Dave's last response to Boaz, I see what you mean, and I also
understand Dave's reasoning better, now. FWIW, I never disagreed with
spending effort elsewhere for now. I did think that the mmap flag was
on the horizon, though. From Dave's comments, I think the prospects of
that are slim to none. That's fine, at least we have a definite
direction. Time to update all of the slide decks. =)
Well, let me clarify what I said a bit here, because I feel like I'm
being unfairly blamed for putting data integrity as the highest
priority for DAX+pmem instead of falling in line and chanting
"Performance! Performance! Performance!" with everyone else.
Let me state this clearly: I'm not opposed to making optimisations
that change the way applications and the kernel interact. I like the
idea of MAP_SYNC, but I see this sort of API/behaviour change as a
last resort when all else fails, not a "first and only" optimisation
The big issue we have right now is that we haven't made the DAX/pmem
infrastructure work correctly and reliably for general use. Hence
adding new APIs to workaround cases where we haven't yet provided
correct behaviour, let alone optimised for performance is, quite
frankly, a clear case premature optimisation.
We need a solid foundation on which to build a fast, safe pmem
storage stack. Rushing to add checkbox performance requirements or
features to demonstrate "progress" leads us down the path of btrfs -
a code base that we are forever struggling with because the
foundation didn't solve known hard problems at an early stage of
developement (e.g. ENOSPC, single tree lock, using generic RAID and
device layers, etc). This results in a code base full of entrenched
deficiencies that are almost impossible to fix and I, personally, do
not want to end up with DAX being in a similar place.
Getting fsync to work with DAX is one of these "known hard problems"
that we really need to solve before we try to optimise for
performance. Once we have solid, workable infrastructure, we'll be
in a much better place to evaluate the merits of optimisations that
reduce or eliminate dirty tracking overhead that is required for
providing data integrity.
From this perspective, I'd much prefer that we look to generic
mapping infrastructure optimisations before we look to one-off API
additions for systems running PMEM. Yes, it's harder to do, but the
end result of such an approach is that everyone benefits, not just
some proprietary application that almost nobody uses.
Indeed, it may be that we need to revist previous work like using an
rcu-aware btree for the mapping tree instead of a radix tree, as was
prototyped way back in ~2007 by Peter Zjilstra. If we can make
infrastructure changes that mostly remove the overhead of tracking
everything in the kernel, then we don't need to add special
userspace API changes to minimise the kernel tracking overhead.
Only if we can't bring the overhead of kernel-side dirty tracking
down to a reasonable overhead should we be considering a new API
that puts the responsibility on userspace for syncing data, and even
then we'll need to be very, very careful about it.
However, such discussions are a complete distraction to the problems
we need to solve right now. i.e. we need to focus on making DAX+pmem
work safely and reliably. Once we've done that, then we can focus on
performance optimisations and, perhaps, new interfaces to userspace.