[PATCH] ext2, ext4: Fix issue with missing journal entry
by Ross Zwisler
As it is currently written ext4_dax_mkwrite() assumes that the call into
__dax_mkwrite() will not have to do a block allocation so it doesn't create
a journal entry. For a read that creates a zero page to cover a hole
followed by a write that actually allocates storage this is incorrect. The
ext4_dax_mkwrite() -> __dax_mkwrite() -> __dax_fault() path calls
get_blocks() to allocate storage.
Fix this by having the ->page_mkwrite fault handler call ext4_dax_fault()
as this function already has all the logic needed to allocate a journal
entry and call __dax_fault().
Also update the ext2 fault handlers in this same way to remove duplicate
code and keep the logic between ext2 and ext4 the same.
Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
---
fs/ext2/file.c | 19 +------------------
fs/ext4/file.c | 19 ++-----------------
2 files changed, 3 insertions(+), 35 deletions(-)
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 2c88d68..c1400b1 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -80,23 +80,6 @@ static int ext2_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
return ret;
}
-static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
-{
- struct inode *inode = file_inode(vma->vm_file);
- struct ext2_inode_info *ei = EXT2_I(inode);
- int ret;
-
- sb_start_pagefault(inode->i_sb);
- file_update_time(vma->vm_file);
- down_read(&ei->dax_sem);
-
- ret = __dax_mkwrite(vma, vmf, ext2_get_block, NULL);
-
- up_read(&ei->dax_sem);
- sb_end_pagefault(inode->i_sb);
- return ret;
-}
-
static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
struct vm_fault *vmf)
{
@@ -124,7 +107,7 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
static const struct vm_operations_struct ext2_dax_vm_ops = {
.fault = ext2_dax_fault,
.pmd_fault = ext2_dax_pmd_fault,
- .page_mkwrite = ext2_dax_mkwrite,
+ .page_mkwrite = ext2_dax_fault,
.pfn_mkwrite = ext2_dax_pfn_mkwrite,
};
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 1126436..d2e8500 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -262,23 +262,8 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
return result;
}
-static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
-{
- int err;
- struct inode *inode = file_inode(vma->vm_file);
-
- sb_start_pagefault(inode->i_sb);
- file_update_time(vma->vm_file);
- down_read(&EXT4_I(inode)->i_mmap_sem);
- err = __dax_mkwrite(vma, vmf, ext4_dax_mmap_get_block, NULL);
- up_read(&EXT4_I(inode)->i_mmap_sem);
- sb_end_pagefault(inode->i_sb);
-
- return err;
-}
-
/*
- * Handle write fault for VM_MIXEDMAP mappings. Similarly to ext4_dax_mkwrite()
+ * Handle write fault for VM_MIXEDMAP mappings. Similarly to ext4_dax_fault()
* handler we check for races agaist truncate. Note that since we cycle through
* i_mmap_sem, we are sure that also any hole punching that began before we
* were called is finished by now and so if it included part of the file we
@@ -311,7 +296,7 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
static const struct vm_operations_struct ext4_dax_vm_ops = {
.fault = ext4_dax_fault,
.pmd_fault = ext4_dax_pmd_fault,
- .page_mkwrite = ext4_dax_mkwrite,
+ .page_mkwrite = ext4_dax_fault,
.pfn_mkwrite = ext4_dax_pfn_mkwrite,
};
#else
--
2.5.0
4 years, 10 months
[GIT PULL] dax-fixes for 4.5-rc6
by Ross Zwisler
Hi Linus, please pull from:
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git dax-fixes
This fixes several issues with the current DAX code, including possible data
corruption and kernel OOPSes. This also includes bugs with raw block devices
that never opt-in to DAX, so can affect existing applications and setups.
1) DAX is used by default on raw block devices that are capable of
supporting it. This creates an issue because there are still uses of the
block device that use the page cache, and having one block device user
doing DAX I/O and another doing page cache I/O can lead to data corruption.
2) When S_DAX is set on an inode we assume that if there are pages attached
to the mapping (mapping->nrpages != 0), those pages are clean zero pages
that were used to service reads from holes. This wasn't true in all cases.
3) ext4 online defrag combined with DAX I/O could lead to data corruption.
4) The DAX block/sector zeroing code needs a valid struct block_device,
which it wasn't always getting. This could lead to a kernel OOPS.
5) The DAX writeback code needs a valid struct block_device, which it
wasn't always getting. This could lead to a kernel OOPS.
6) The DAX writeback code needs to be called for sync(2) and syncfs(2).
This could lead to data loss.
I know DAX fixes have historically gone up through Andrew Morton's -mm tree,
but for some reason he's been silent on this series for the last few weeks. I
think that the problems being fixed are important enough that we really
shouldn't wait until v4.6.
Please let me know if you'd like additional justification on why I think these
should be merged, or if you have any questions.
Thanks,
- Ross
----------------------------------------------------------------
Dan Williams (1):
block: disable block device DAX by default
Ross Zwisler (4):
ext2, ext4: only set S_DAX for regular inodes
ext4: Online defrag not supported with DAX
dax: give DAX clearing code correct bdev
dax: move writeback calls into the filesystems
block/Kconfig | 13 +++++++++++++
fs/block_dev.c | 19 +++++++++++++++++--
fs/dax.c | 21 +++++++++++----------
fs/ext2/inode.c | 16 +++++++++++++---
fs/ext4/inode.c | 6 +++++-
fs/ext4/ioctl.c | 5 +++++
fs/xfs/xfs_aops.c | 6 +++++-
fs/xfs/xfs_aops.h | 1 +
fs/xfs/xfs_bmap_util.c | 3 ++-
include/linux/dax.h | 8 +++++---
mm/filemap.c | 12 ++++--------
11 files changed, 81 insertions(+), 29 deletions(-)
commit 1f20410488863337259f528b3210c464c72ee27c
Author: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Date: Sun Feb 7 00:19:13 2016 -0700
dax: move writeback calls into the filesystems
Previously calls to dax_writeback_mapping_range() for all DAX filesystems
(ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
dax_writeback_mapping_range() needs a struct block_device, and it used to
get that from inode->i_sb->s_bdev. This is correct for normal inodes
mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
block devices and for XFS real-time files.
Instead, call dax_writeback_mapping_range() directly from the filesystem
->writepages function so that it can supply us with a valid block
device. This also fixes DAX code to properly flush caches in response to
sync(2).
Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Signed-off-by: Jan Kara <jack(a)suse.cz>
commit dda4dcbdc9242eb600aa2d271d80bf7e1762aa63
Author: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Date: Fri Feb 5 22:07:04 2016 -0700
dax: give DAX clearing code correct bdev
dax_clear_blocks() needs a valid struct block_device and previously it was
using inode->i_sb->s_bdev in all cases. This is correct for normal inodes
on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
block devices and for XFS real-time devices.
Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change its
arguments to take a bdev and a sector instead of an inode and a block.
This better reflects what the function does, and it allows the filesystem
and raw block device code to pass in an appropriate struct block_device.
Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Suggested-by: Dan Williams <dan.j.williams(a)intel.com>
Reviewed-by: Jan Kara <jack(a)suse.cz>
commit 0e2dcfb5b46129c01738d610b7a4aa4165800d5e
Author: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Date: Sat Feb 13 21:44:27 2016 -0700
ext4: Online defrag not supported with DAX
Online defrag operations for ext4 are hard coded to use the page cache.
See ext4_ioctl() -> ext4_move_extents() -> move_extent_per_page()
When combined with DAX I/O, which circumvents the page cache, this can
result in data corruption. This was observed with xfstests ext4/307 and
ext4/308.
Fix this by only allowing online defrag for non-DAX files.
Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Reviewed-by: Jan Kara <jack(a)suse.cz>
commit 10d08a7339df8a252c7365d8877c72acd2aed109
Author: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Date: Fri Feb 12 18:15:25 2016 -0700
ext2, ext4: only set S_DAX for regular inodes
When S_DAX is set on an inode we assume that if there are pages attached
to the mapping (mapping->nrpages != 0), those pages are clean zero pages
that were used to service reads from holes. Any dirty data associated with
the inode should be in the form of DAX exceptional entries
(mapping->nrexceptional) that is written back via
dax_writeback_mapping_range().
With the current code, though, this isn't always true. For example, ext2
and ext4 directory inodes can have S_DAX set, but have their dirty data
stored as dirty page cache entries. For these types of inodes, having
S_DAX set doesn't really make sense since their I/O doesn't actually happen
through the DAX code path.
Instead, only allow S_DAX to be set for regular inodes for ext2 and ext4.
This allows us to have strict DAX vs non-DAX paths in the writeback code.
Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Reviewed-by: Jan Kara <jack(a)suse.cz>
commit 67e8c633958de5c168ed857c94a4573cc0442c97
Author: Dan Williams <dan.j.williams(a)intel.com>
Date: Fri Feb 12 13:08:47 2016 -0800
block: disable block device DAX by default
The recent *sync enabling discovered that we are inserting into the
block_device pagecache counter to the expectations of the dirty data
tracking for dax mappings. This can lead to data corruption.
We want to support DAX for block devices eventually, but it requires
wider changes to properly manage the pagecache.
[<ffffffff81576d93>] dump_stack+0x85/0xc2
[<ffffffff812b9ee0>] dax_writeback_mapping_range+0x60/0xe0
[<ffffffff812a1d4f>] blkdev_writepages+0x3f/0x50
[<ffffffff811db011>] do_writepages+0x21/0x30
[<ffffffff811cb6a6>] __filemap_fdatawrite_range+0xc6/0x100
[<ffffffff811cb75a>] filemap_write_and_wait+0x4a/0xa0
[<ffffffff812a15e0>] set_blocksize+0x70/0xd0
[<ffffffff812a273d>] sb_set_blocksize+0x1d/0x50
[<ffffffff8132ac9b>] ext4_fill_super+0x75b/0x3360
[<ffffffff81583381>] ? vsnprintf+0x201/0x4c0
[<ffffffff815836d9>] ? snprintf+0x49/0x60
[<ffffffff81263010>] mount_bdev+0x180/0x1b0
[<ffffffff8132a540>] ? ext4_calculate_overhead+0x370/0x370
[<ffffffff8131ad95>] ext4_mount+0x15/0x20
[<ffffffff81263908>] mount_fs+0x38/0x170
Mark the support broken so its disabled by default, but otherwise still
available for testing.
Cc: Jan Kara <jack(a)suse.cz>
Cc: Jens Axboe <axboe(a)fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox(a)intel.com>
Cc: Al Viro <viro(a)ftp.linux.org.uk>
Reported-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Suggested-by: Dave Chinner <david(a)fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Reviewed-by: Jan Kara <jack(a)suse.cz>
4 years, 10 months
We could not deliver your parcel, #00000409250
by FedEx Ground
Dear Customer,
Your parcel has arrived at February 24. Courier was unable to deliver the parcel to you.
Shipment Label is attached to this email.
Kind regards,
Leroy Crane,
Sr. Support Manager.
4 years, 10 months
[GIT PULL] libnvdimm, nfit: fixes for 4.5-rc6
by Williams, Dan J
Hi Linus, please pull from:
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm libnvdimm-fixes
...to receive:
1/ Two fixes for compatibility with the ACPI 6.1 specification.
Without these fixes multi-interface DIMMs will fail to be probed, and
address range scrub commands to find memory errors will give results
that the kernel will mis-interpret. For multi-interface DIMMs Linux
will accept either the original 6.0 implementation or 6.1. For address
range scrub we'll only support 6.1 since ACPI formalized this DSM
differently than the original example [1] implemented in v4.2. The
expectation is that production systems will only ever ship the ACPI 6.1
address range scrub command definition.
2/ The wider async address range scrub work targeting 4.6 discovered
that the original synchronous implementation in 4.5 is not sizing its
return buffer correctly.
3/ Arnd caught that my recent fix to the size of the pfn_t flags missed
updating the flags variable used in the pmem driver.
4/ Toshi found that we mishandle the memremap() return value in
devm_memremap().
This branch has received a clean build success notification from the
kbuild robot across 105 configs.
[1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
The following changes since commit 18558cae0272f8fd9647e69d3fec1565a7949865:
Linux 4.5-rc4 (2016-02-14 13:05:20 -0800)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm libnvdimm-fixes
for you to fetch changes up to c45442055dfdeb265cc20c9eeaa9fd11a75fbf51:
nvdimm: use 'u64' for pfn flags (2016-02-23 17:17:20 -0800)
----------------------------------------------------------------
Arnd Bergmann (1):
nvdimm: use 'u64' for pfn flags
Dan Williams (3):
nfit: fix multi-interface dimm handling, acpi6.1 compatibility
libnvdimm, tools/testing/nvdimm: fix 'ars_status' output buffer sizing
nfit: update address range scrub commands to the acpi 6.1 format
Toshi Kani (1):
devm_memremap: Fix error value when memremap failed
drivers/acpi/nfit.c | 90 ++++++++++++++++++++--------------------
drivers/nvdimm/bus.c | 20 ++++-----
drivers/nvdimm/pmem.c | 2 +-
include/linux/libnvdimm.h | 3 +-
include/uapi/linux/ndctl.h | 11 ++++-
kernel/memremap.c | 4 +-
tools/testing/nvdimm/test/nfit.c | 8 +++-
7 files changed, 75 insertions(+), 63 deletions(-)
commit c45442055dfdeb265cc20c9eeaa9fd11a75fbf51
Author: Arnd Bergmann <arnd(a)arndb.de>
Date: Mon Feb 22 22:58:34 2016 +0100
nvdimm: use 'u64' for pfn flags
A recent bugfix changed pfn_t to always be 64-bit wide, but did not
change the code in pmem.c, which is now broken on 32-bit architectures
as reported by gcc:
In file included from ../drivers/nvdimm/pmem.c:28:0:
drivers/nvdimm/pmem.c: In function 'pmem_alloc':
include/linux/pfn_t.h:15:17: error: large integer implicitly truncated to unsigned type [-Werror=overflow]
#define PFN_DEV (1ULL << (BITS_PER_LONG_LONG - 3))
This changes the intermediate pfn_flags in struct pmem_device to
be 64 bit wide as well, so they can store the flags correctly.
Signed-off-by: Arnd Bergmann <arnd(a)arndb.de>
Fixes: db78c22230d0 ("mm: fix pfn_t vs highmem")
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
commit 93f834df9c2d4e362dfdc4b05daa0a4e18814836
Author: Toshi Kani <toshi.kani(a)hpe.com>
Date: Sat Feb 20 14:32:24 2016 -0800
devm_memremap: Fix error value when memremap failed
devm_memremap() returns an ERR_PTR() value in case of error.
However, it returns NULL when memremap() failed. This causes
the caller, such as the pmem driver, to proceed and oops later.
Change devm_memremap() to return ERR_PTR(-ENXIO) when memremap()
failed.
Signed-off-by: Toshi Kani <toshi.kani(a)hpe.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: <stable(a)vger.kernel.org>
Reviewed-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
commit 4577b0665515e0abc7bc72562d6328d179598815
Author: Dan Williams <dan.j.williams(a)intel.com>
Date: Wed Feb 17 13:08:58 2016 -0800
nfit: update address range scrub commands to the acpi 6.1 format
The original format of these commands from the "NVDIMM DSM Interface
Example" [1] are superseded by the ACPI 6.1 definition of the "NVDIMM Root
Device _DSMs" [2].
[1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
[2]: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
"9.20.7 NVDIMM Root Device _DSMs"
Changes include:
1/ New 'restart' fields in ars_status, unfortunately these are
implemented in the middle of the existing definition so this change
is not backwards compatible. The expectation is that shipping
platforms will only ever support the ACPI 6.1 definition.
2/ New status values for ars_start ('busy') and ars_status ('overflow').
Cc: Vishal Verma <vishal.l.verma(a)intel.com>
Cc: Linda Knippers <linda.knippers(a)hpe.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
commit 747ffe11b440ef9ea752888806d3aac677ca52a4
Author: Dan Williams <dan.j.williams(a)intel.com>
Date: Fri Feb 19 15:21:14 2016 -0800
libnvdimm, tools/testing/nvdimm: fix 'ars_status' output buffer sizing
Use the output length specified in the command to size the receive
buffer rather than the arbitrary 4K limit.
This bug was hiding the fact that the ndctl implementation of
ndctl_bus_cmd_new_ars_status() was not specifying an output buffer size.
Cc: <stable(a)vger.kernel.org>
Cc: Vishal Verma <vishal.l.verma(a)intel.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
commit 6697b2cf69d4363266ca47eaebc49ef13dabc1c9
Author: Dan Williams <dan.j.williams(a)intel.com>
Date: Thu Feb 4 16:51:00 2016 -0800
nfit: fix multi-interface dimm handling, acpi6.1 compatibility
ACPI 6.1 clarified that multi-interface dimms require multiple control
region entries (DCRs) per dimm. Previously we were assuming that a
control region is only present when block-data-windows are present.
This implementation was done with an eye to be compatibility with the
looser ACPI 6.0 interpretation of this table.
1/ When coalescing the memory device (MEMDEV) tables for a single dimm,
coalesce on device_handle rather than control region index.
2/ Whenever we disocver a control region with non-zero block windows
re-scan for block-data-window (BDW) entries.
We may need to revisit this if a DIMM ever implements a format interface
outside of blk or pmem, but that is not on the foreseeable horizon.
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
4 years, 10 months