[PATCH v1] libnvdimm, namespace: Replace kmemdup() with kstrndup()
by Andy Shevchenko
kstrndup() takes care of '\0' terminator for the strings.
Use it here instead of kmemdup() + explicit terminating the input string.
Signed-off-by: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
---
drivers/nvdimm/namespace_devs.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 28afdd668905..19525f025539 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -270,11 +270,10 @@ static ssize_t __alt_name_store(struct device *dev, const char *buf,
if (dev->driver || to_ndns(dev)->claim)
return -EBUSY;
- input = kmemdup(buf, len + 1, GFP_KERNEL);
+ input = kstrndup(buf, len, GFP_KERNEL);
if (!input)
return -ENOMEM;
- input[len] = '\0';
pos = strim(input);
if (strlen(pos) + 1 > NSLABEL_NAME_LEN) {
rc = -EINVAL;
--
2.17.1
3 years, 10 months
[PATCH v2 0/3] Add support for memcpy_mcsafe
by Balbir Singh
memcpy_mcsafe() is an API currently used by the pmem subsystem to convert
errors while doing a memcpy (machine check exception errors) to a return
value. This patchset consists of three patches
1. The first patch is a bug fix to handle machine check errors correctly
while walking the page tables in kernel mode, due to huge pmd/pud sizes
2. The second patch adds memcpy_mcsafe() support, this is largely derived
from existing code
3. The third patch registers for callbacks on machine check exceptions and
in them uses specialized knowledge of the type of page to decide whether
to handle the MCE as is or to return to a fixup address present in
memcpy_mcsafe(). If a fixup address is used, then we return an error
value of -EFAULT to the caller.
Testing
A large part of the testing was done under a simulator by selectively
inserting machine check exceptions in a test driver doing memcpy_mcsafe
via ioctls.
Changelog v2
- Fix the logic of shifting in addr_to_pfn
- Use shift consistently instead of PAGE_SHIFT
- Fix a typo in patch1
Balbir Singh (3):
powerpc/mce: Bug fixes for MCE handling in kernel space
powerpc/memcpy: Add memcpy_mcsafe for pmem
powerpc/mce: Handle memcpy_mcsafe
arch/powerpc/include/asm/mce.h | 3 +-
arch/powerpc/include/asm/string.h | 2 +
arch/powerpc/kernel/mce.c | 77 ++++++++++++-
arch/powerpc/kernel/mce_power.c | 26 +++--
arch/powerpc/lib/Makefile | 2 +-
arch/powerpc/lib/memcpy_mcsafe_64.S | 212 ++++++++++++++++++++++++++++++++++++
6 files changed, 308 insertions(+), 14 deletions(-)
create mode 100644 arch/powerpc/lib/memcpy_mcsafe_64.S
--
2.13.6
3 years, 10 months
[PATCH 0/5] fix radix tree multi-order iteration race
by Ross Zwisler
The following series gets the radix tree test suite compiling again in
the current linux/master, adds a unit test which exposes a race in the
radix tree multi-order iteration code, and then fixes that race.
This race was initially hit on a v4.15 based kernel and results in a GP
fault. I've described the race in detail in patches 4 and 5.
The fix is simple and necessary, and I think it should be merged for
v4.17.
This tree has gotten positive build confirmation from the 0-day bot,
passes the updated radix tree test suite, xfstests, and the original
test that was hitting the race with the v4.15 based kernel.
Ross Zwisler (5):
radix tree test suite: fix mapshift build target
radix tree test suite: fix compilation issue
radix tree test suite: add item_delete_rcu()
radix tree test suite: multi-order iteration race
radix tree: fix multi-order iteration race
lib/radix-tree.c | 6 ++--
tools/include/linux/spinlock.h | 3 +-
tools/testing/radix-tree/Makefile | 6 ++--
tools/testing/radix-tree/multiorder.c | 63 +++++++++++++++++++++++++++++++++++
tools/testing/radix-tree/test.c | 19 +++++++++++
tools/testing/radix-tree/test.h | 3 ++
6 files changed, 91 insertions(+), 9 deletions(-)
--
2.14.3
3 years, 11 months
[PATCH v2] mm: disallow mapping that conflict for devm_memremap_pages()
by Dave Jiang
When pmem namespaces created are smaller than section size, this can cause
issue during removal and gpf was observed:
[ 249.613597] general protection fault: 0000 1 SMP PTI
[ 249.725203] CPU: 36 PID: 3941 Comm: ndctl Tainted: G W
4.14.28-1.el7uek.x86_64 #2
[ 249.745495] task: ffff88acda150000 task.stack: ffffc900233a4000
[ 249.752107] RIP: 0010:__put_page+0x56/0x79
[ 249.844675] Call Trace:
[ 249.847410] devm_memremap_pages_release+0x155/0x23a
[ 249.852953] release_nodes+0x21e/0x260
[ 249.857138] devres_release_all+0x3c/0x48
[ 249.861606] device_release_driver_internal+0x15c/0x207
[ 249.867439] device_release_driver+0x12/0x14
[ 249.872204] unbind_store+0xba/0xd8
[ 249.876098] drv_attr_store+0x27/0x31
[ 249.880186] sysfs_kf_write+0x3f/0x46
[ 249.884266] kernfs_fop_write+0x10f/0x18b
[ 249.888734] __vfs_write+0x3a/0x16d
[ 249.892628] ? selinux_file_permission+0xe5/0x116
[ 249.897881] ? security_file_permission+0x41/0xbb
[ 249.903133] vfs_write+0xb2/0x1a1
[ 249.906835] ? syscall_trace_enter+0x1ce/0x2b8
[ 249.911795] SyS_write+0x55/0xb9
[ 249.915397] do_syscall_64+0x79/0x1ae
[ 249.919485] entry_SYSCALL_64_after_hwframe+0x3d/0x0
Add code to check whether we have mapping already in the same section and
prevent additional mapping from created if that is the case.
Signed-off-by: Dave Jiang <dave.jiang(a)intel.com>
---
v2: Change dev_warn() to dev_WARN() to provide helpful backtrace. (Robert E)
kernel/memremap.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 5857267a4af5..a734b1747466 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -176,10 +176,27 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
unsigned long pfn, pgoff, order;
pgprot_t pgprot = PAGE_KERNEL;
int error, nid, is_ram;
+ struct dev_pagemap *conflict_pgmap;
align_start = res->start & ~(SECTION_SIZE - 1);
align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
- align_start;
+ align_end = align_start + align_size - 1;
+
+ conflict_pgmap = get_dev_pagemap(PHYS_PFN(align_start), NULL);
+ if (conflict_pgmap) {
+ dev_WARN(dev, "Conflicting mapping in same section\n");
+ put_dev_pagemap(conflict_pgmap);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ conflict_pgmap = get_dev_pagemap(PHYS_PFN(align_end), NULL);
+ if (conflict_pgmap) {
+ dev_WARN(dev, "Conflicting mapping in same section\n");
+ put_dev_pagemap(conflict_pgmap);
+ return ERR_PTR(-ENOMEM);
+ }
+
is_ram = region_intersects(align_start, align_size,
IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE);
@@ -199,7 +216,6 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
mutex_lock(&pgmap_lock);
error = 0;
- align_end = align_start + align_size - 1;
foreach_order_pgoff(res, order, pgoff) {
error = __radix_tree_insert(&pgmap_radix,
3 years, 11 months
[ndctl PATCHv3] ndctl: Use max_available_extent for namespace
by Keith Busch
The available_size attribute returns all the unused regions, but a
namespace has to use contiguous free regions. This patch uses the
attribute returning the largest capacity that can be created for
determining if the namespace can be created.
Signed-off-by: Keith Busch <keith.busch(a)intel.com>
---
v2 -> v3:
Added dbg() message to indicate kernel support
ndctl/lib/libndctl.c | 33 +++++++++++++++++++++++++++++++++
ndctl/lib/libndctl.sym | 1 +
ndctl/libndctl.h | 2 ++
ndctl/namespace.c | 2 +-
4 files changed, 37 insertions(+), 1 deletion(-)
diff --git a/ndctl/lib/libndctl.c b/ndctl/lib/libndctl.c
index 47e005e..78c70d6 100644
--- a/ndctl/lib/libndctl.c
+++ b/ndctl/lib/libndctl.c
@@ -2025,6 +2025,39 @@ NDCTL_EXPORT unsigned long long ndctl_region_get_available_size(
return strtoull(buf, NULL, 0);
}
+NDCTL_EXPORT unsigned long long ndctl_region_get_max_available_extent(
+ struct ndctl_region *region)
+{
+ unsigned int nstype = ndctl_region_get_nstype(region);
+ struct ndctl_ctx *ctx = ndctl_region_get_ctx(region);
+ char *path = region->region_buf;
+ int len = region->buf_len;
+ char buf[SYSFS_ATTR_SIZE];
+
+ switch (nstype) {
+ case ND_DEVICE_NAMESPACE_PMEM:
+ case ND_DEVICE_NAMESPACE_BLK:
+ break;
+ default:
+ return 0;
+ }
+
+ if (snprintf(path, len,
+ "%s/max_available_extent", region->region_path) >= len) {
+ err(ctx, "%s: buffer too small!\n",
+ ndctl_region_get_devname(region));
+ return ULLONG_MAX;
+ }
+
+ /* fall back to legacy behavior if max extents is not exported */
+ if (sysfs_read_attr(ctx, path, buf) < 0) {
+ dbg(ctx, "max extents attribute not exported on older kernels\n");
+ return ndctl_region_get_available_size(region);
+ }
+
+ return strtoull(buf, NULL, 0);
+}
+
NDCTL_EXPORT unsigned int ndctl_region_get_range_index(struct ndctl_region *region)
{
return region->range_index;
diff --git a/ndctl/lib/libndctl.sym b/ndctl/lib/libndctl.sym
index c1228e5..22fd026 100644
--- a/ndctl/lib/libndctl.sym
+++ b/ndctl/lib/libndctl.sym
@@ -123,6 +123,7 @@ global:
ndctl_region_get_mappings;
ndctl_region_get_size;
ndctl_region_get_available_size;
+ ndctl_region_get_max_available_extent;
ndctl_region_get_type;
ndctl_region_get_namespace_seed;
ndctl_region_get_btt_seed;
diff --git a/ndctl/libndctl.h b/ndctl/libndctl.h
index be997ac..624115d 100644
--- a/ndctl/libndctl.h
+++ b/ndctl/libndctl.h
@@ -338,6 +338,8 @@ unsigned int ndctl_region_get_interleave_ways(struct ndctl_region *region);
unsigned int ndctl_region_get_mappings(struct ndctl_region *region);
unsigned long long ndctl_region_get_size(struct ndctl_region *region);
unsigned long long ndctl_region_get_available_size(struct ndctl_region *region);
+unsigned long long ndctl_region_get_max_available_extent(
+ struct ndctl_region *region);
unsigned int ndctl_region_get_range_index(struct ndctl_region *region);
unsigned int ndctl_region_get_type(struct ndctl_region *region);
struct ndctl_namespace *ndctl_region_get_namespace_seed(
diff --git a/ndctl/namespace.c b/ndctl/namespace.c
index fe86d82..4a562a2 100644
--- a/ndctl/namespace.c
+++ b/ndctl/namespace.c
@@ -764,7 +764,7 @@ static int namespace_create(struct ndctl_region *region)
return -EAGAIN;
}
- available = ndctl_region_get_available_size(region);
+ available = ndctl_region_get_max_available_extent(region);
if (!available || p.size > available) {
debug("%s: insufficient capacity size: %llx avail: %llx\n",
devname, p.size, available);
--
2.14.3
3 years, 11 months
[fstests PATCH 1/2] src/: fix up mmap() error checking
by Ross Zwisler
I noticed that in some of my C tests in src/ I was incorrectly checking for
mmap() failure by looking for NULL instead of MAP_FAILED. Fix those and
clean up some places where we were testing against -1 (the actual value of
MAP_FAILED) which was manually being cast to a pointer.
Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
---
src/aio-dio-regress/aio-io-setup-with-nonwritable-context-pointer.c | 2 +-
src/fstest.c | 2 +-
src/t_ext4_dax_inline_corruption.c | 4 ++--
src/t_ext4_dax_journal_corruption.c | 4 ++--
src/t_mmap_stale_pmd.c | 2 ++
src/t_mmap_writev.c | 2 +-
6 files changed, 9 insertions(+), 7 deletions(-)
diff --git a/src/aio-dio-regress/aio-io-setup-with-nonwritable-context-pointer.c b/src/aio-dio-regress/aio-io-setup-with-nonwritable-context-pointer.c
index 092cbb42..af381177 100644
--- a/src/aio-dio-regress/aio-io-setup-with-nonwritable-context-pointer.c
+++ b/src/aio-dio-regress/aio-io-setup-with-nonwritable-context-pointer.c
@@ -40,7 +40,7 @@ main(int __attribute__((unused)) argc, char **argv)
void *addr;
addr = mmap(NULL, 4096, PROT_READ, MAP_SHARED|MAP_ANONYMOUS, 0, 0);
- if (!addr) {
+ if (addr == MAP_FAILED) {
perror("mmap");
exit(1);
}
diff --git a/src/fstest.c b/src/fstest.c
index f7e2d3eb..e4b9e081 100644
--- a/src/fstest.c
+++ b/src/fstest.c
@@ -138,7 +138,7 @@ bozo!
exit(1);
}
p = mmap(NULL, file_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
- if (p == (char *)-1) {
+ if (p == MAP_FAILED) {
perror("mmap");
exit(1);
}
diff --git a/src/t_ext4_dax_inline_corruption.c b/src/t_ext4_dax_inline_corruption.c
index 4b7d8938..b52bcc0d 100644
--- a/src/t_ext4_dax_inline_corruption.c
+++ b/src/t_ext4_dax_inline_corruption.c
@@ -37,14 +37,14 @@ int main(int argc, char *argv[])
err_exit("fd");
data = mmap(NULL, len, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
- if (!data)
+ if (data == MAP_FAILED)
err_exit("mmap data");
/* this fallocate turns off inline data and turns on DAX */
fallocate(fd, 0, 0, PAGE(2));
dax_data = mmap(NULL, len, PROT_READ, MAP_SHARED, fd, 0);
- if (!dax_data)
+ if (dax_data == MAP_FAILED)
err_exit("mmap dax_data");
/*
diff --git a/src/t_ext4_dax_journal_corruption.c b/src/t_ext4_dax_journal_corruption.c
index 18a2acdc..fccef8f5 100644
--- a/src/t_ext4_dax_journal_corruption.c
+++ b/src/t_ext4_dax_journal_corruption.c
@@ -60,7 +60,7 @@ int main(int argc, char *argv[])
fallocate(fd, 0, 0, len);
dax_data = mmap(NULL, len, PROT_READ, MAP_SHARED, fd, 0);
- if (!dax_data)
+ if (dax_data == MAP_FAILED)
err_exit("mmap dax_data");
/*
@@ -76,7 +76,7 @@ int main(int argc, char *argv[])
chattr_cmd(chattr, "+j", file);
data = mmap(NULL, len, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
- if (!data)
+ if (data == MAP_FAILED)
err_exit("mmap data");
/*
diff --git a/src/t_mmap_stale_pmd.c b/src/t_mmap_stale_pmd.c
index b4472227..6a52201c 100644
--- a/src/t_mmap_stale_pmd.c
+++ b/src/t_mmap_stale_pmd.c
@@ -41,6 +41,8 @@ int main(int argc, char *argv[])
ftruncate(fd, MiB(4));
data = mmap(NULL, MiB(2), PROT_READ, MAP_SHARED, fd, MiB(2));
+ if (data == MAP_FAILED)
+ err_exit("mmap");
/*
* This faults in a 2MiB zero page to satisfy the read.
diff --git a/src/t_mmap_writev.c b/src/t_mmap_writev.c
index e5ca08ab..43acc15f 100644
--- a/src/t_mmap_writev.c
+++ b/src/t_mmap_writev.c
@@ -51,7 +51,7 @@ int main(int argc, char **argv)
if (fd==-1) {perror("open");exit(1);}
base = mmap(NULL,16384,PROT_READ,MAP_SHARED,fd,0);
- if (base == (void *)-1) { perror("mmap");exit(1); }
+ if (base == MAP_FAILED) { perror("mmap");exit(1); }
unlink(new_file);
--
2.14.4
3 years, 11 months
[PATCH v2 0/2] ext4: fix DAX dma vs truncate/hole-punch
by Ross Zwisler
This series from Dan:
https://lists.01.org/pipermail/linux-nvdimm/2018-March/014913.html
added synchronization between DAX dma and truncate/hole-punch in XFS.
This short series adds analogous support to ext4.
I've added calls to ext4_break_layouts() everywhere that ext4 removes
blocks from an inode's map.
The timings in XFS are such that it's difficult to hit this race. Dan
was able to show the race by manually introducing delays in the direct
I/O path.
For ext4, though, its trivial to hit this race, and a hit will result in
a trigger of this WARN_ON_ONCE() in dax_disassociate_entry():
WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
I've made an xfstest which tests all the paths where we now call
ext4_break_layouts(). Each of the four paths easily hits this race many
times in my test setup with the xfstest. You can find that test here:
https://lists.01.org/pipermail/linux-nvdimm/2018-June/016435.html
With these patches applied, I've still seen occasional hits of the above
WARN_ON_ONCE(), which tells me that we still have some work to do. I'll
continue looking at these more rare hits.
---
Changes in v2:
* A little cleanup to each patch as suggested by Jan.
* Removed the ext4_break_layouts() call in ext4_truncate_failed_write()
and added a comment instead. (Jan)
* Added reviewed-by tags from Jan.
Ross Zwisler (2):
dax: dax_layout_busy_page() warn on !exceptional
ext4: handle layout changes to pinned DAX mappings
fs/dax.c | 10 +++++++++-
fs/ext4/ext4.h | 1 +
fs/ext4/extents.c | 12 ++++++++++++
fs/ext4/inode.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/truncate.h | 4 ++++
5 files changed, 72 insertions(+), 1 deletion(-)
--
2.14.4
3 years, 11 months
[PATCH v9 0/3] ndctl, monitor: add ndctl monitor daemon
by QI Fuli
This is the v9 patch for ndctl monitor, a tiny daemon to monitor
the smart events of nvdimm DIMMs. Since NVDIMM does not have a
feature like mirroring, if it breaks down, the data will be
impossible to restore. Ndctl monitor daemon will catch the smart
events notify from firmware and outputs notification to logfile,
therefore users can replace NVDIMM before it is completely broken.
Signed-off-by: QI Fuli <qi.fuli(a)jp.fujitsu.com>
---
Change log since v8:
- Adding ndctl_cmd_smart_get_event_flags() api
- Adding monitor_filter_arg to the union in util_filter_ctx
- Removing is_dir()
- Replacing malloc + vsprintf with vasprintf() in log_file() and log_syslog()
- Adding parse_monitor_event()
- Refactoring util_dimm_event_filter()
- Adding event_flags to monitor
- Refactoring dimm_event_to_json()
- Adding check_dimm_supported_threshold_alarms()
- Fixing fail token
Change log since v7:
- Replacing logreport() with log_file() and log_syslog()
- Refactoring read_config_file()
- Replacing set_confile() with parse_config()
- Fixing the ndctl/ndct.conf file
Change log since v6:
- Changing License to GPL-2.0
- Adding event object to output notification
- Adding [--dimm-event] option to filter notification by event type
- Rewriting read_config_file()
- Replacing monitor_dimm_event() with monitor_event()
- Renaming some variables
Change log since v5:
- Fixing systemd unit file cannot be installed bug
- Adding license to ./util/abspath.c
Change log since v4:
- Adding OPTION_FILENAME to make sure filename is correct
- Adding configuration file
- Adding [--config-file] option to override the default configuration
- Making some options support multiple space-seperated arguments
- Making systemctl enable ndctl-monitor.service command work
- Making systemctl restart ndctl-monitor.service command work
- Making the directory of systemd unit file to be configurable
- Changing log_file() and log_syslog() to logreport()
- Changing date format in notification to nanoseconds since epoch
- Changing select() to epoll()
- Adding filter_bus() and filter_region()
Change log since v3:
- Removing create-monitor, show-monitor, list-monitor, destroy-monitor
- Adding [--daemon] option to run ndctl monitor as a daemon
- Using systemd to manage ndctl monitor daemon
- Replacing filter_monitor_dimm() with filter_dimm()
Change log since v2:
- Changing the interface of daemon to the ndctl command line
- Changing the name of daemon form "nvdimmd" to "monitor"
- Removing the config file, unit_file, nvdimmd dir
- Removing nvdimmd_test program
- Adding ndctl/monitor.c
Change log since v1:
- Adding a config file(/etc/nvdimmd/nvdimmd.conf)
- Using struct log_ctx instead of syslog()
- Using log_syslog() to save the notify messages to syslog
- Using log_file() to save the notify messages to special file
- Adding LOG_NOTICE level to log_priority
- Using automake instead of Makefile
- Adding a new util file(nvdimmd/util.c) including helper functions
needed for nvdimm daemon
- Adding nvdimmd_test program
QI Fuli (3):
ndctl, monitor: add ndctl monitor
ndctl, monitor: add main ndctl monitor configuration file
ndctl, monitor: add the unit file of systemd for ndctl-monitor service
autogen.sh | 3 +-
builtin.h | 1 +
configure.ac | 22 ++
ndctl/Makefile.am | 12 +-
ndctl/lib/libndctl.sym | 1 +
ndctl/lib/smart.c | 17 +
ndctl/libndctl.h | 6 +
ndctl/monitor.c | 646 ++++++++++++++++++++++++++++++++++++
ndctl/monitor.conf | 41 +++
ndctl/ndctl-monitor.service | 7 +
ndctl/ndctl.c | 1 +
util/filter.h | 9 +
12 files changed, 764 insertions(+), 2 deletions(-)
create mode 100644 ndctl/monitor.c
create mode 100644 ndctl/monitor.conf
create mode 100644 ndctl/ndctl-monitor.service
--
2.18.0
3 years, 11 months
[PATCH 0/2] Namespace creation fixups
by Keith Busch
This is a three-part fixup to the warning that occurs when the available
capacity is fragmented. When this occurs, the user may believe they can
create a larger namespace than is actually possible. This was resulting
in the following kernel warning:
nd_region region0: allocation underrun: 0x0 of 0x1400000000 bytes
WARNING: CPU: 32 PID: 1975 at drivers/nvdimm/namespace_devs.c:913 size_store+0x879/0x8d0 [libnvdimm]
The kernel side of this determines the maximum size by calculating the
largest contiguous extent that can be allocated. If the requested size
exceeds that, an error is returned early instead of reaching the
alarming kernel warning.
To make it possible for the user to know the maximum size it may
request, a new attribute is exported that shows the largest available
extent.
Finally, separate from this series, ndctl is updated to make use of this
new attribute when creating a namespace.
Keith Busch (2):
libnvdimm: Use largest contiguous area for namespace size
libnvdimm: Export max available extent
drivers/nvdimm/dimm_devs.c | 29 +++++++++++++++++++++++++++++
drivers/nvdimm/namespace_devs.c | 2 +-
drivers/nvdimm/nd-core.h | 3 +++
drivers/nvdimm/region_devs.c | 39 +++++++++++++++++++++++++++++++++++++++
4 files changed, 72 insertions(+), 1 deletion(-)
--
2.14.3
3 years, 11 months
[PATCH v4 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages
by Dan Williams
Changes since v3 [1]:
* Introduce dax_lock_page(), using the radix exceptional entry lock, for
pinning down page->mapping while memory_failure() interrogates the
page. (Jan)
* Collect acks and reviews from Tony and Jan.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2018-June/016153.html
---
As it stands, memory_failure() gets thoroughly confused by dev_pagemap
backed mappings. The recovery code has specific enabling for several
possible page states and needs new enabling to handle poison in dax
mappings.
In order to support reliable reverse mapping of user space addresses:
1/ Add new locking in the memory_failure() rmap path to prevent races
that would typically be handled by the page lock.
2/ Since dev_pagemap pages are hidden from the page allocator and the
"compound page" accounting machinery, add a mechanism to determine the
size of the mapping that encompasses a given poisoned pfn.
3/ Given pmem errors can be repaired, change the speculatively accessed
poison protection, mce_unmap_kpfn(), to be reversible and otherwise
allow ongoing access from the kernel.
A side effect of this enabling is that MADV_HWPOISON becomes usable for
dax mappings, however the primary motivation is to allow the system to
survive userspace consumption of hardware-poison via dax. Specifically
the current behavior is:
mce: Uncorrected hardware memory error in user-access at af34214200
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
mce: [Hardware Error]: Machine check events logged
{1}[Hardware Error]: event severity: corrected
Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
[..]
Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
mce: Memory error not recovered
<reboot>
...and with these changes:
Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
Memory failure: 0x20cb00: recovery action for dax page: Recovered
---
Dan Williams (12):
device-dax: Convert to vmf_insert_mixed and vm_fault_t
device-dax: Cleanup vm_fault de-reference chains
device-dax: Enable page_mapping()
device-dax: Set page->index
filesystem-dax: Set page->index
mm, madvise_inject_error: Let memory_failure() optionally take a page reference
x86/mm/pat: Prepare {reserve,free}_memtype() for "decoy" addresses
x86/memory_failure: Introduce {set,clear}_mce_nospec()
mm, memory_failure: Pass page size to kill_proc()
filesystem-dax: Introduce dax_lock_page()
mm, memory_failure: Teach memory_failure() about dev_pagemap pages
libnvdimm, pmem: Restore page attributes when clearing errors
arch/x86/include/asm/set_memory.h | 42 +++++++++
arch/x86/kernel/cpu/mcheck/mce-internal.h | 15 ---
arch/x86/kernel/cpu/mcheck/mce.c | 38 +-------
arch/x86/mm/pat.c | 16 +++
drivers/dax/device.c | 97 ++++++++++++--------
drivers/nvdimm/pmem.c | 26 +++++
drivers/nvdimm/pmem.h | 13 +++
fs/dax.c | 92 ++++++++++++++++++-
include/linux/dax.h | 15 +++
include/linux/huge_mm.h | 5 +
include/linux/mm.h | 1
include/linux/set_memory.h | 14 +++
mm/huge_memory.c | 4 -
mm/madvise.c | 18 +++-
mm/memory-failure.c | 143 +++++++++++++++++++++++++++--
15 files changed, 434 insertions(+), 105 deletions(-)
3 years, 11 months