On Thu, Aug 30, 2018 at 02:49:07PM -0400, Mike Snitzer wrote:
On Thu, Aug 30 2018 at 5:30am -0400, Jan Kara <jack(a)suse.cz>
> Well, changing device from DAX-capable to DAX-incapable is problematic for
> filesystem on top of it as well. Filesystems simply don't expect this
> feature of a device can change so they would fail in unexpected ways. Also
> PFNs from the pmem (DAX-capable) device that are already mapped to user page
> tables won't magically become unmapped so those processes will still have
> DAX access to those areas of the device.
As you point out, how are the upper layers (e.g. filesystems)
to reliably cope with this runtime switch to from DAX to non-DAX access?
They can't right now. There's unsolved races between page faults,
invalidations and changing the file operations to/from DAX
dynamically. This is the entire problem facing the dynamic per-inode
DAX on/off flag - if it happens globally to the filesystem without
warning, then the filesystem is screwed.
To support the block device changing between DAX and non-DAX
dynamically, then the filesystem needs to first invalidate the entire
filesystem cache, eject all cached inodes from memory, any cached
metadata that is using DAX, etc to clear out all the DAX mappings
it have. And it has to do it without racing with new page faults or
IO that might map new DAX pages. And I'm ignoring the fact that we
can't eject referenced inodes (i.e. open files) from the inode cache
and so we currently cannot safely change the DAX on such files.
That's a blocker right now.
Once we can safely change the DAX state of open files, we've got to
co=ordinate the block device state change with the filesystem - the
filesystem wide invalidation has to be done before the block device can
start the change of state, and the filesystem must remain completely
stopped until the block device has completed it's change of state.
So AFAICT this ends up being "stop the world instantly, eject the
world from memory, rebuild the world from scratch, start the world
again". Freezing the filesystem doesn't stop the world - we can
still do read IO and page faults, so that doesn't prevent pagefault
races with the invalidation leaving DAX references in the page
cache. Hence we currently have no valid "stop the world" mechanism
in the kernel other than unmount, which we can't do while there are
What about MAP_SYNC applications? If we turn off DAX with those
applications still running, we silently break them and users won't
know until the system loses power and they see data corruption after
the system comes back. However, applications SEGVing unpredictably
becuse of "transparent" storage state changes is almost as
Dynamically changing block device DAX support seems like a
non-starter to me. At least, it's a non starter until we add a lot
more infrastructure, solve a bunch of really hard problems and
define how active userspace controlled DAX-only features behave when
DAX is no longer available...