Do you want to advertise on facebook? We're here to help.
We wil manually post your product/logo/link on Facebook Groups and will
give you a full report with links of each live post where your advertisement
Unsubscribe option is available on the footer of our website
DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking. This series allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled.
Dave, can you please take this through the XFS tree as we discussed during
the v4 review?
Changes since v4:
- Reworked the DAX flags handling to simplify things and get rid of
RADIX_DAX_PTE. (Jan & Christoph)
- Moved RADIX_DAX_* macros to be inline functions in include/linux/dax.h.
- Got rid of unneeded macros RADIX_DAX_HZP_ENTRY() and
RADIX_DAX_EMPTY_ENTRY(), and instead just pass arbitrary flags to
- Re-ordered the arguments to dax_wake_mapping_entry_waiter() to be more
consistent with the rest of the code. (Jan)
- Moved radix_dax_order() inside of the #ifdef CONFIG_FS_DAX_PMD block.
This was causing a build error on various systems that don't define
- Patch 5 fixes what I believe is a missing error return in
- Fixed the page_start calculation for PMDs that was previously found in
dax_entry_start(). (Jan) This code is now included directly in
- dax_entry_waitqueue() now sets up the struct exceptional_entry_key() of
the caller as a service to reduce code duplication. (Christoph)
- In grab_mapping_entry() we now hold the radix tree entry lock for PMD
downgrades while we release the tree_lock and do an
- Removed our last BUG_ON() in dax.c, replacing it with a WARN_ON_ONCE()
and an error return.
- The dax_iomap_fault() and dax_iomap_pmd_fault() handlers both now call
ops->iomap_end() to ensure that we properly balance the
ops->iomap_begin() calls with respect to locking, allocations, etc.
- Removed __GFP_FS from the vmf.gfp_mask used in dax_iomap_pmd_fault().
Thank you again to Jan, Christoph and Dave for their review feedback.
Here are some related things that are not included in this patch set, but
which I plan on doing in the near future:
- Add tracepoint support for the PTE and PMD based DAX fault handlers.
- Move the DAX 4k zero page handling to use a single 4k zero page instead
of allocating pages on demand. This will mirror the way that things are
done for the 2 MiB case, and will reduce the amount of memory we use
when reading 4k holes in DAX.
- Change the API to the PMD fault hanlder so it takes a vmf, and at a
layer above DAX make sure that the vmf.gfp_mask given to DAX for both
PMD and PTE faults doesn't include __GFP_FS. (Jan)
These work items will happen after review & integration with Jan's patch
set for DAX radix tree cleaning.
This series was built upon xfs/xfs-4.9-reflink with PMD performance fixes
from Toshi Kani and Dan Williams. Dan's patch has already been merged for
v4.8, and Toshi's patches are currently queued in Andrew Morton's mm tree
for v4.9 inclusion. These patches are not needed for correct operation,
only for good performance.
Here is a tree containing my changes:
This tree has passed xfstests for ext2, ext4 and XFS both with and without
DAX, and has passed targeted testing where I inserted, removed and flushed
DAX PTEs and PMDs in every combination I could think of.
Previously reported performance numbers:
In some simple mmap I/O testing with FIO the use of PMD faults more than
doubles I/O performance as compared with PTE faults. Here is the FIO
script I used for my testing:
Here are the performance results with XFS using only pte faults:
READ: io=1022.7MB, aggrb=557610KB/s, minb=557610KB/s, maxb=557610KB/s, mint=1878msec, maxt=1878msec
WRITE: io=1025.4MB, aggrb=559084KB/s, minb=559084KB/s, maxb=559084KB/s, mint=1878msec, maxt=1878msec
Here are performance numbers for that same test using PMD faults:
READ: io=1022.7MB, aggrb=1406.7MB/s, minb=1406.7MB/s, maxb=1406.7MB/s, mint=727msec, maxt=727msec
WRITE: io=1025.4MB, aggrb=1410.4MB/s, minb=1410.4MB/s, maxb=1410.4MB/s, mint=727msec, maxt=727msec
This was done on a random lab machine with a PMEM device made from memmap'd
RAM. To get XFS to use PMD faults, I did the following:
mkfs.xfs -f -d su=2m,sw=1 /dev/pmem0
mount -o dax /dev/pmem0 /mnt/pmem0
xfs_io -c "extsize 2m" /mnt/pmem0
Ross Zwisler (17):
ext4: allow DAX writeback for hole punch
ext4: tell DAX the size of allocation holes
dax: remove buffer_size_valid()
ext2: remove support for DAX PMD faults
ext2: return -EIO on ext2_iomap_end() failure
dax: make 'wait_table' global variable static
dax: remove the last BUG_ON() from fs/dax.c
dax: consistent variable naming for DAX entries
dax: coordinate locking for offsets in PMD range
dax: remove dax_pmd_fault()
dax: correct dax iomap code namespace
dax: add dax_iomap_sector() helper function
dax: dax_iomap_fault() needs to call iomap_end()
dax: move RADIX_DAX_* defines to dax.h
dax: add struct iomap based DAX PMD support
xfs: use struct iomap based DAX PMD fault path
dax: remove "depends on BROKEN" from FS_DAX_PMD
fs/Kconfig | 1 -
fs/dax.c | 718 ++++++++++++++++++++++++++++------------------------
fs/ext2/file.c | 35 +--
fs/ext2/inode.c | 4 +-
fs/ext4/inode.c | 7 +-
fs/xfs/xfs_aops.c | 26 +-
fs/xfs/xfs_aops.h | 3 -
fs/xfs/xfs_file.c | 10 +-
include/linux/dax.h | 60 ++++-
mm/filemap.c | 6 +-
10 files changed, 466 insertions(+), 404 deletions(-)
I boot my DAX test machine with "memmap=8G!16G,8G!24G" on the kernel
command line to give me two 8GB pmem devices. This has worked fine
on all kernels including 4.8. I just updated that test machine to a
TOT linus kernel (4.9), and now I get a single 16GB pmem device.
i.e. the memory map the kernel generates is different. This is
what I get on boot from a 4.9 kernel:
[ 0.000000] e820: BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000bffdefff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000bffdf000-0x00000000bfffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000083fffffff] usable
[ 0.000000] NX (Execute Disable) protection: active
[ 0.000000] e820: user-defined physical RAM map:
[ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable
[ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[ 0.000000] user: [mem 0x0000000000100000-0x00000000bffdefff] usable
[ 0.000000] user: [mem 0x00000000bffdf000-0x00000000bfffffff] reserved
[ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved
[ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
[ 0.000000] user: [mem 0x0000000100000000-0x00000003ffffffff] usable
[ 0.000000] user: [mem 0x0000000400000000-0x00000007ffffffff] persistent (type 12)
[ 0.000000] user: [mem 0x0000000800000000-0x000000083fffffff] usable
On 4.8, I get two persistent (type 12) sections, each of 8GB. 4.9 is
giving me a single 16GB region. This needs to behave like a 4.8
kernel and return two persistent regions - persistent memory device
setup cannot be allowed to change from kernel to kernel. Change in
mapping and device setup like this will cause the corruption of
and/or loss of data in the persistent memory devices that have
changed shape, size or disappeared....