On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote:
On Wed 25-10-17 09:23:22, Dave Chinner wrote:
> On Tue, Oct 24, 2017 at 05:24:14PM +0200, Jan Kara wrote:
> > From: Christoph Hellwig <hch(a)lst.de>
> >
> > Return IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare
> > blocks for writing and the inode is pinned, and has dirty fields other
> > than the timestamps.
>
> That's "fdatasync dirty", not "fsync dirty".
Correct.
> IOMAP_F_DIRTY needs a far better description of it's semantics than
> "/* block mapping is not yet on persistent storage */" so we know
> exactly what filesystems are supposed to be implementing here. I
> suspect that what it really is meant to say is:
>
> /*
> * IOMAP_F_DIRTY indicates the inode has uncommitted metadata to
> * written data and requires fdatasync to commit to persistent storage.
> */
I'll update the comment. Thanks!
> [....]
>
> > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > index f179bdf1644d..b43be199fbdf 100644
> > --- a/fs/xfs/xfs_iomap.c
> > +++ b/fs/xfs/xfs_iomap.c
> > @@ -33,6 +33,7 @@
> > #include "xfs_error.h"
> > #include "xfs_trans.h"
> > #include "xfs_trans_space.h"
> > +#include "xfs_inode_item.h"
> > #include "xfs_iomap.h"
> > #include "xfs_trace.h"
> > #include "xfs_icache.h"
> > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin(
> > trace_xfs_iomap_found(ip, offset, length, 0, &imap);
> > }
> >
> > + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) &&
> > + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> > + iomap->flags |= IOMAP_F_DIRTY;
>
> This is the very definition of an inode that is "fdatasync dirty".
>
> Hmmmm, shouldn't this also be set for read faults, too?
No, read faults don't need to set IOMAP_F_DIRTY since user cannot write any
data to the page which he'd then like to be persistent. The only reason why
I thought it could be useful for a while was that it would be nice to make
MAP_SYNC mapping provide the guarantee that data you see now is the data
you'll see after a crash
Isn't that the entire point of MAP_SYNC? i.e. That when we return
from a page fault, the app knows that the data and it's underlying
extent is on persistent storage?
but we cannot provide that guarantee for RO
mapping anyway if someone else has the page mapped as well. So I just
decided not to return IOMAP_F_DIRTY for read faults.
If there are multiple MAP_SYNC mappings to the inode, I would have
expected that they all sync all of the data/metadata on every page
fault, regardless of who dirtied the inode. An RO mapping doesn't
mean the data/metadata on the inode can't change, it just means it
can't change through that mapping. Running fsync() to guarantee the
persistence of that data/metadata doesn't actually changing any
data....
IOWs, if read faults don't guarantee the mapped range has stable
extents on a MAP_SYNC mapping, then I think MAP_SYNC is broken
because it's not giving consistent guarantees to userspace. Yes, it
works fine when only one MAP_SYNC mapping is modifying the inode,
but the moment we have concurrent operations on the inode that
aren't MAP_SYNC or O_SYNC this goes out the window....
Cheers,
Dave.
--
Dave Chinner
david(a)fromorbit.com