On Tue, Feb 23, 2016 at 09:09:47PM -0700, Ross Zwisler wrote:
> On Tue, Feb 23, 2016 at 03:56:17PM -0800, Dan Williams wrote:
> > On Tue, Feb 23, 2016 at 3:43 PM, Jeff Moyer <jmoyer(a)redhat.com> wrote:
> > > Dan Williams <dan.j.williams(a)intel.com> writes:
> > >
> > >> On Tue, Feb 23, 2016 at 3:28 PM, Jeff Moyer <jmoyer(a)redhat.com>
> > >>>> The crux of the problem, in my opinion, is that we're
asking for an "I
> > >>>> know what I'm doing" flag, and I expect that's an
> > >>>> for a filesystem to trust generically.
> > >>>
> > >>> The file system already trusts that. If an application
> > >>> fsync properly, guess what, it will break. This line of
> > >>> doesn't make any sense to me.
> > >>
> > >> No, I'm worried about the case where an app specifies
> > >> uses fsync correctly, and fails to flush cpu cache.
> > >
> > > I don't think the kernel needs to put training wheels on
> > >
> > >>>> If you can get MAP_PMEM_AWARE in, great, but I'm more and
more of the
> > >>>> opinion that the "I know what I'm doing"
interface should be something
> > >>>> separate from today's trusted filesystems.
> > >>>
> > >>> Just so I understand you, MAP_PMEM_AWARE isn't the "I
know what I'm
> > >>> doing" interface, right?
> > >>
> > >> It is the "I know what I'm doing" interface,
MAP_PMEM_AWARE asserts "I
> > >> know when to flush the cpu relative to an fsync()".
> > >
> > > I see. So I think your argument is that new file systems (such as Nova)
> > > can have whacky new semantics, but existing file systems should provide
> > > the more conservative semantics that they have provided since the dawn
> > > of time (even if we add a new mmap flag to control the behavior).
> > >
> > > I don't agree with that. :)
> > >
> > Fair enough. Recall, I was pushing MAP_DAX not to long ago. It just
> > seems like a Sisyphean effort to push an mmap flag up the XFS hill and
> > maybe that effort is better spent somewhere else.
> Well, for what it's worth MAP_SYNC feels like the "right" solution to
> understand that we are a ways from having it implemented, but it seems like
> the correct way to have applications work with persistent memory in a perfect
> world, and worth the effort.
> MAP_PMEM_AWARE is interesting, but even in a perfect world it seems like a
> partial solution - applications still need to call *sync to get the FS
> metadata to be durable, and they have no reliable way of knowing which of
> their actions will cause the metadata to be out of sync.
> Dave, is your objection to the MAP_SYNC idea a practical one about complexity
> and time to get it implemented, or do you think it's is the wrong solution?
Jan, I just noticed that this chain didn't CC you nor linux-fsdevel, so you
may have missed it. All the gory details are here:
Let me provide a little background for my question. (Everyone else on the
thread feel free to jump in if you feel like my summary is incorrect or
There is a new persistent memory programming model outlined on pmem.io and
implemented by the NVM Library (nvml).
This new programming model is based on the idea that an application should be
able to create a DAX MMAP, and then from then on satisfy the data durability
requirements of the application purely in userspace. This is done by using
non-temporal stores or cached writes followed by flushes, the same way that we
do things in the kernel.
Dave was concerned that this breaks down for XFS because even if the
application were to sync all its writes to media, the filesystem could be
making associated metadata changes that the application wouldn't and couldn't
To sync these metadata changes to media, the application would still need to
One proposal from Christoph was that we could add a MMAP_SYNC flag that
essentially says "make all metadata operations synchronous":
The worry is that this would be complex to implement, and that we maybe don't
want yet another DAX special case in the FS code.
Another way that we could implement this would be to key off of the DAX mount
option / inode setting for all mmaps that use DAX. This would preclude the
need for changes to the mmap() API.
My question: How far away are we from having such a metadata durability
guarantee in ext4? Do we have cases where the metadata changes associated
with a page fault, etc. could be out of sync with the data writes that are
being made durable by the application in userspace?
I see ext4 creating journal entries around page faults in places like
ext4_dax_fault() - this should durably record any metadata changes associated
with that page fault before the fault completes, correct?
That is not true. Journalling makes sure metadata changes are recorded in
the journal but you have to commit the transaction to make the change
really durable. That happens either in response to sync / fsync or
asynchronously every couple of seconds. So with ext4 you have exactly the
same issues with durability as with XFS.
Are there other cases you can think of with ext4 where we would need
*sync for DAX just to be sure we are safely synchronizing metadata?
So I think implementing something like MAP_SYNC semantics for ext4 is
reasonably doable. Basically we would have to make sure that we commit a
transaction already during a page fault which is not that hard to do.
Jan Kara <jack(a)suse.com>
SUSE Labs, CR