Since the last RFC patch set much of the discussion of supporting RDMA with
FS DAX has been around the semantics of the lease mechanism. Within that
thread it was suggested I try and write some documentation and/or tests for the
new mechanism being proposed. I have created a foundation to test lease
functionality within xfstests. This should be close to being accepted.
Before writing additional lease tests, or changing lots of kernel code, this
email presents documentation for the new proposed "layout lease" semantic.
At Linux Plumbers just over a week ago, I presented the current state of the
patch set and the outstanding issues. Based on the discussion there, well as
follow up emails, I propose the following addition to the fcntl() man page.
<fcntl man page addition>
Layout (F_LAYOUT) leases are special leases which can be used to control and/or
be informed about the manipulation of the underlying layout of a file.
A layout is defined as the logical file block -> physical file block mapping
including the file size and sharing of physical blocks among files. Note that
the unwritten state of a block is not considered part of file layout.
**Read layout lease F_RDLCK | F_LAYOUT**
Read layout leases can be used to be informed of layout changes by the
system or other users. This lease is similar to the standard read (F_RDLCK)
lease in that any attempt to change the _layout_ of the file will be reported to
the process through the lease break process. But this lease is different
because the file can be opened for write and data can be read and/or written to
the file as long as the underlying layout of the file does not change.
Therefore, the lease is not broken if the file is simply open for write, but
_may_ be broken if an operation such as, truncate(), fallocate() or write()
results in changing the underlying layout.
**Write layout lease (F_WRLCK | F_LAYOUT)**
Write Layout leases can be used to break read layout leases to indicate that
the process intends to change the underlying layout lease of the file.
A process which has taken a write layout lease has exclusive ownership of the
file layout and can modify that layout as long as the lease is held.
Operations which change the layout are allowed by that process. But operations
from other file descriptors which attempt to change the layout will break the
lease through the standard lease break process. The F_LAYOUT flag is used to
indicate a difference between a regular F_WRLCK and F_WRLCK with F_LAYOUT. In
the F_LAYOUT case opens for write do not break the lease. But some operations,
if they change the underlying layout, may.
The distinction between read layout leases and write layout leases is that
write layout leases can change the layout without breaking the lease within the
owning process. This is useful to guarantee a layout prior to specifying the
unbreakable flag described below.
**Unbreakable Layout Leases (F_UNBREAK)**
In order to support pinning of file pages by direct user space users an
unbreakable flag (F_UNBREAK) can be used to modify the read and write layout
lease. When specified, F_UNBREAK indicates that any user attempting to break
the lease will fail with ETXTBUSY rather than follow the normal breaking
Both read and write layout leases can have the unbreakable flag (F_UNBREAK)
specified. The difference between an unbreakable read layout lease and an
unbreakable write layout lease are that an unbreakable read layout lease is
_not_ exclusive. This means that once a layout is established on a file,
multiple unbreakable read layout leases can be taken by multiple processes and
used to pin the underlying pages of that file.
Care must therefore be taken to ensure that the layout of the file is as the
user wants prior to using the unbreakable read layout lease. A safe mechanism
to do this would be to take a write layout lease and use fallocate() to set the
layout of the file. The layout lease can then be "downgraded" to unbreakable
read layout as long as no other user broke the write layout lease.
</fcntl man page addition>
This patch series enables DAX support for virtio-fs filesystem. Patches
are based on 5.3-rc5 kernel and need first patch series posted for
virtio-fs support with subject "virtio-fs: shared file system for virtual
Enabling DAX seems to improve performance for most of the operations
in general a great deal. I have reported performance numbers in first patch
series so I am not repeating these here.
Any comments or feedback is welcome.
Sebastien Boeuf (3):
virtio: Add get_shm_region method
virtio: Implement get_shm_region for PCI transport
virtio: Implement get_shm_region for MMIO transport
Stefan Hajnoczi (4):
dax: remove block device dependencies
fuse, dax: add fuse_conn->dax_dev field
virtio_fs, dax: Set up virtio_fs dax_device
fuse, dax: add DAX mmap support
Vivek Goyal (12):
dax: Pass dax_dev to dax_writeback_mapping_range()
fuse: Keep a list of free dax memory ranges
fuse: implement FUSE_INIT map_alignment field
fuse: Introduce setupmapping/removemapping commands
fuse, dax: Implement dax read/write operations
fuse: Define dax address space operations
fuse, dax: Take ->i_mmap_sem lock during dax page fault
fuse: Maintain a list of busy elements
dax: Create a range version of dax_layout_busy_page()
fuse: Add logic to free up a memory range
fuse: Release file in process context
fuse: Take inode lock for dax inode truncation
drivers/dax/super.c | 3 +-
drivers/virtio/virtio_mmio.c | 32 +
drivers/virtio/virtio_pci_modern.c | 108 +++
fs/dax.c | 89 +-
fs/ext2/inode.c | 2 +-
fs/ext4/inode.c | 2 +-
fs/fuse/cuse.c | 3 +-
fs/fuse/dir.c | 2 +
fs/fuse/file.c | 1206 +++++++++++++++++++++++++++-
fs/fuse/fuse_i.h | 99 ++-
fs/fuse/inode.c | 138 +++-
fs/fuse/virtio_fs.c | 134 +++-
fs/xfs/xfs_aops.c | 2 +-
include/linux/dax.h | 12 +-
include/linux/virtio_config.h | 17 +
include/uapi/linux/fuse.h | 47 +-
include/uapi/linux/virtio_fs.h | 3 +
include/uapi/linux/virtio_mmio.h | 11 +
include/uapi/linux/virtio_pci.h | 11 +-
19 files changed, 1868 insertions(+), 53 deletions(-)
Before people get too excited this isn't a proposal to kill DAX. The
topic proposal is a discussion to resolve lingering open questions
that currently motivate ext4 and xfs to scream "EXPERIMENTAL" when the
current DAX facilities are enabled. The are 2 primary concerns to
resolve. Enumerate the remaining features/fixes, and identify a path
to implement it all without regressing any existing application use
An enumeration of remaining projects follows, please expand this list
if I missed something:
* "DAX" has no specific meaning by itself, users have 2 use cases for
"DAX" capabilities: userspace cache management via MAP_SYNC, and page
cache avoidance where the latter aspect of DAX has no current api to
discover / use it. The project is to supplement MAP_SYNC with a
MAP_DIRECT facility and MADV_SYNC / MADV_DIRECT to indicate the same
dynamically via madvise. Similar to O_DIRECT, MAP_DIRECT would be an
application hint to avoid / minimiize page cache usage, but no strict
guarantee like what MAP_SYNC provides.
* Resolve all "if (dax) goto fail;" patterns in the kernel. Outside of
longterm-GUP (a topic in its own right) the projects here are
XFS-reflink and XFS-realtime-device support. DAX+reflink effectively
requires a given physical page to be mapped into two different inodes
at different (page->index) offsets. The challenge is to support
DAX-reflink without violating any existing application visible
semantics, the operating assumption / strawman to debate is that
experimental status is not blanket permission to go change existing
semantics in backwards incompatible ways.
* Deprecate, but not remove, the DAX mount option. Too many flows
depend on the option so it will never go away, but the facility is too
coarse. Provide an option to enable MAP_SYNC and
more-likely-to-do-something-useful-MAP_DIRECT on a per-directory
basis. The current proposal is to allow this property to only be
toggled while the directory is empty to avoid the complications of
racing page invalidation with new DAX mappings.
Secondary projects, i.e. important but I would submit are not in the
critical path to removing the "experimental" designation:
* Filesystem-integrated badblock management. Hook up the media error
notifications from libnvdimm to the filesystem to allow for operations
like "list files with media errors" and "enumerate bad file offsets on
a granulatiy smaller than a page". Another consideration along these
lines is to integrate machine-check-handling and dynamic error
notification into a filesystem interface. I've heard complaints that
the sigaction() based mechanism to receive BUS_MCEERR_* information,
while sufficient for the "System RAM" use case, is not precise enough
for the "Persistent Memory / DAX" use case where errors are repairable
and sub-page error information is useful.
* Userfaultfd for file-backed mappings and DAX
Ideally all the usual DAX, persistent memory, and GUP suspects could
be in the room to discuss this:
* Jan Kara
* Dave Chinner
* Christoph Hellwig
* Jeff Moyer
* Johannes Thumshirn
* Matthew Wilcox
* John Hubbard
* Jérôme Glisse
* MM folks for the reflink vs 'struct page' vs Xarray considerations
This patchset aims to take care of this issue to make reflink and dedupe
work correctly (actually in read/write path, there still has some problems,
such as the page->mapping and page->index issue, in mmap path) in XFS under
It is based on Goldwyn's patchsets: "v4 Btrfs dax support" and the latest
iomap. I borrowed some patches related and made a few fix to make it
basically works fine.
For dax framework:
1. adapt to the latest change in iomap (two iomaps).
1. distinguish dax write/zero from normal write/zero.
2. remap extents after COW.
3. add file contents comparison function based on dax framework.
4. use xfs_break_layouts() instead of break_layout to support dax.
Goldwyn Rodrigues (3):
dax: replace mmap entry in case of CoW
fs: dedup file range to use a compare function
dax: memcpy before zeroing range
Shiyang Ruan (4):
dax: Introduce dax_copy_edges() for COW.
dax: copy data before write.
xfs: handle copy-on-write in fsdax write() path.
xfs: support dedupe for fsdax.
fs/btrfs/ioctl.c | 3 +-
fs/dax.c | 211 +++++++++++++++++++++++++++++++++++++----
fs/iomap/buffered-io.c | 8 +-
fs/ocfs2/file.c | 2 +-
fs/read_write.c | 11 ++-
fs/xfs/xfs_bmap_util.c | 6 +-
fs/xfs/xfs_file.c | 10 +-
fs/xfs/xfs_iomap.c | 3 +-
fs/xfs/xfs_iops.c | 11 ++-
fs/xfs/xfs_reflink.c | 79 ++++++++-------
include/linux/dax.h | 16 ++--
include/linux/fs.h | 9 +-
12 files changed, 291 insertions(+), 78 deletions(-)
Changes since v1 :
- Simplify the profile to a hopefully non-controversial set of
attributes that address the most common sources of contributor
confusion, or maintainer frustration.
- Rename "Subsystem Profile" to "Maintainer Entry Profile". Not every
entry in MAINTAINERS represents a full subsystem. There may be driver
local considerations to communicate to a submitter in addition to wider
- Delete the old P: tag in MAINTAINERS rather than convert to a new E:
tag (Joe Perches).
At last years Plumbers Conference I proposed the Maintainer Entry
Profile as a document that a maintainer can provide to set contributor
expectations and provide fodder for a discussion between maintainers
about the merits of different maintainer policies.
For those that did not attend, the goal of the Maintainer Entry Profile,
and the Maintainer Handbook more generally, is to provide a desk
reference for maintainers both new and experienced. The session
The first rule of kernel maintenance is that there are no hard and
fast rules. That state of affairs is both a blessing and a curse. It
has served the community well to be adaptable to the different
people and different problem spaces that inhabit the kernel
community. However, that variability also leads to inconsistent
experiences for contributors, little to no guidance for new
contributors, and unnecessary stress on current maintainers. There
are quite a few of people who have been around long enough to make
enough mistakes that they have gained some hard earned proficiency.
However if the kernel community expects to keep growing it needs to
be able both scale the maintainers it has and ramp new ones without
necessarily let them make a decades worth of mistakes to learn the
To be clear, the proposed document does not impose or suggest new
rules. Instead it provides an outlet to document the unwritten rules
and policies in effect for each subsystem, and that each subsystem
might decide differently for whatever reason.
Dan Williams (3):
MAINTAINERS: Reclaim the P: tag for Maintainer Entry Profile
Maintainer Handbook: Maintainer Entry Profile
libnvdimm, MAINTAINERS: Maintainer Entry Profile
Documentation/maintainer/index.rst | 1
.../maintainer/maintainer-entry-profile.rst | 99 ++++++++++++++++++++
Documentation/nvdimm/maintainer-entry-profile.rst | 64 +++++++++++++
MAINTAINERS | 20 ++--
4 files changed, 175 insertions(+), 9 deletions(-)
create mode 100644 Documentation/maintainer/maintainer-entry-profile.rst
create mode 100644 Documentation/nvdimm/maintainer-entry-profile.rst
Masahiro Yamada <yamada.masahiro(a)socionext.com> writes:
> Now that there is no overwrap between symbols from ELF files and
> ones from Module.symvers.
> So, the 'exported twice' warning should be reported irrespective
> of where the symbol in question came from. Only the exceptional case
> is when __crc_<sym> symbol appears before __ksymtab_<sym>. This
> typically occurs for EXPORT_SYMBOL in .S files.
After apply this patch, I get the following modpost warnings when doing:
$ make M=tools/tesing/nvdimm
Building modules, stage 2.
MODPOST 12 modules
WARNING: tools/testing/nvdimm/libnvdimm: 'nvdimm_bus_lock' exported twice. Previous export was in drivers/nvdimm/libnvdimm.ko
WARNING: tools/testing/nvdimm/libnvdimm: 'nvdimm_bus_unlock' exported twice. Previous export was in drivers/nvdimm/libnvdimm.ko
WARNING: tools/testing/nvdimm/libnvdimm: 'is_nvdimm_bus_locked' exported twice. Previous export was in drivers/nvdimm/libnvdimm.ko
WARNING: tools/testing/nvdimm/libnvdimm: 'devm_nvdimm_memremap' exported twice. Previous export was in drivers/nvdimm/libnvdimm.ko
WARNING: tools/testing/nvdimm/libnvdimm: 'nd_fletcher64' exported twice. Previous export was in drivers/nvdimm/libnvdimm.ko
WARNING: tools/testing/nvdimm/libnvdimm: 'to_nd_desc' exported twice. Previous export was in drivers/nvdimm/libnvdimm.ko
WARNING: tools/testing/nvdimm/libnvdimm: 'to_nvdimm_bus_dev' exported twice. Previous export was in drivers/nvdimm/libnvdimm.ko
There are a lot of these warnings. :) If I revert this patch, no
> Signed-off-by: Masahiro Yamada <yamada.masahiro(a)socionext.com>
> scripts/mod/modpost.c | 1 -
> 1 file changed, 1 deletion(-)
> diff --git a/scripts/mod/modpost.c b/scripts/mod/modpost.c
> index 5234555cf550..6ca38d10efc5 100644
> --- a/scripts/mod/modpost.c
> +++ b/scripts/mod/modpost.c
> @@ -2457,7 +2457,6 @@ static void read_dump(const char *fname, unsigned int kernel)
> s = sym_add_exported(symname, namespace, mod,
> s->kernel = kernel;
> - s->preloaded = 1;
> s->is_static = 0;
> sym_update_crc(symname, mod, crc, export_no(export));