[PATCH v3 0/2] Support ACPI 6.1 update in NFIT Control Region Structure
by Toshi Kani
ACPI 6.1, Table 5-133, updates NVDIMM Control Region Structure as
follows.
- Valid Fields, Manufacturing Location, and Manufacturing Date
are added from reserved range. No change in the structure size.
- IDs (SPD values) are stored as arrays of bytes (i.e. big-endian
format). The spec clarifies that they need to be represented
as arrays of bytes as well.
Patch 1 changes the NFIT driver to comply with ACPI 6.1.
Patch 2 adds a new sysfs file "id" to show NVDIMM ID defined in ACPI 6.1.
The patch-set applies on linux-pm.git acpica.
link: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
---
v3:
- Need to coordinate with ACPICA update (Bob Moore, Dan Williams)
- Integrate with ACPICA changes in struct acpi_nfit_control_region.
(commit 138a95547ab0)
v2:
- Remove 'mfg_location' and 'mfg_date'. (Dan Williams)
- Rename 'unique_id' to 'id' and make this change as a separate patch.
(Dan Williams)
---
Toshi Kani (3):
1/2 acpi/nfit: Update nfit driver to comply with ACPI 6.1
2/3 acpi/nfit: Add sysfs "id" for NVDIMM ID
---
drivers/acpi/nfit.c | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)
4 years
Enabling peer to peer device transactions for PCIe devices
by Deucher, Alexander
This is certainly not the first time this has been brought up, but I'd like to try and get some consensus on the best way to move this forward. Allowing devices to talk directly improves performance and reduces latency by avoiding the use of staging buffers in system memory. Also in cases where both devices are behind a switch, it avoids the CPU entirely. Most current APIs (DirectGMA, PeerDirect, CUDA, HSA) that deal with this are pointer based. Ideally we'd be able to take a CPU virtual address and be able to get to a physical address taking into account IOMMUs, etc. Having struct pages for the memory would allow it to work more generally and wouldn't require as much explicit support in drivers that wanted to use it.
Some use cases:
1. Storage devices streaming directly to GPU device memory
2. GPU device memory to GPU device memory streaming
3. DVB/V4L/SDI devices streaming directly to GPU device memory
4. DVB/V4L/SDI devices streaming directly to storage devices
Here is a relatively simple example of how this could work for testing. This is obviously not a complete solution.
- Device memory will be registered with Linux memory sub-system by created corresponding struct page structures for device memory
- get_user_pages_fast() will return corresponding struct pages when CPU address points to the device memory
- put_page() will deal with struct pages for device memory
Previously proposed solutions and related proposals:
1.P2P DMA
DMA-API/PCI map_peer_resource support for peer-to-peer (http://www.spinics.net/lists/linux-pci/msg44560.html)
Pros: Low impact, already largely reviewed.
Cons: requires explicit support in all drivers that want to support it, doesn't handle S/G in device memory.
2. ZONE_DEVICE IO
Direct I/O and DMA for persistent memory (https://lwn.net/Articles/672457/)
Add support for ZONE_DEVICE IO memory with struct pages. (https://patchwork.kernel.org/patch/8583221/)
Pro: Doesn't waste system memory for ZONE metadata
Cons: CPU access to ZONE metadata slow, may be lost, corrupted on device reset.
3. DMA-BUF
RDMA subsystem DMA-BUF support (http://www.spinics.net/lists/linux-rdma/msg38748.html)
Pros: uses existing dma-buf interface
Cons: dma-buf is handle based, requires explicit dma-buf support in drivers.
4. iopmem
iopmem : A block device for PCIe memory (https://lwn.net/Articles/703895/)
5. HMM
Heterogeneous Memory Management (http://lkml.iu.edu/hypermail/linux/kernel/1611.2/02473.html)
6. Some new mmap-like interface that takes a userptr and a length and returns a dma-buf and offset?
Alex
4 years, 8 months
[resend PATCH v2 00/33] dax: introduce dax_operations
by Dan Williams
[ resend to add dm-devel, linux-block, and fs-devel, apologies for the
duplicates ]
Changes since v1 [1] and the dax-fs RFC [2]:
* rename struct dax_inode to struct dax_device (Christoph)
* rewrite arch_memcpy_to_pmem() in C with inline asm
* use QUEUE_FLAG_WC to gate dax cache management (Jeff)
* add device-mapper plumbing for the ->copy_from_iter() and ->flush()
dax_operations
* kill struct blk_dax_ctl and bdev_direct_access (Christoph)
* cleanup the ->direct_access() calling convention to be page based
(Christoph)
* introduce dax_get_by_host() and don't pollute struct super_block with
dax_device details (Christoph)
[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008586.html
[2]: https://lwn.net/Articles/713064/
---
A few months back, in the course of reviewing the memcpy_nocache()
proposal from Brian, Linus proposed that the pmem specific
memcpy_to_pmem() routine be moved to be implemented at the driver level
[3]:
"Quite frankly, the whole 'memcpy_nocache()' idea or (ab-)using
copy_user_nocache() just needs to die. It's idiotic.
As you point out, it's also fundamentally buggy crap.
Throw it away. There is no possible way this is ever valid or
portable. We're not going to lie and claim that it is.
If some driver ends up using 'movnt' by hand, that is up to that
*driver*. But no way in hell should we care about this one whit in
the sense of <linux/uaccess.h>."
This feedback also dovetails with another fs/dax.c design wart of being
hard coded to assume the backing device is pmem. We call the pmem
specific copy, clear, and flush routines even if the backing device
driver is one of the other 3 dax drivers (axonram, dccssblk, or brd).
There is no reason to spend cpu cycles flushing the cache after writing
to brd, for example, since it is using volatile memory for storage.
Moreover, the pmem driver might be fronting a volatile memory range
published by the ACPI NFIT, or the platform might have arranged to flush
cpu caches on power fail. This latter capability is a feature that has
appeared in embedded storage appliances (pre-ACPI-NFIT nvdimm
platforms).
So, this series:
1/ moves what was previously named "the pmem api" out of the global
namespace and into drivers that need to be concerned with
architecture specific persistent memory considerations.
2/ arranges for dax to stop abusing __copy_user_nocache() and implements
a libnvdimm-local memcpy that uses 'movnt' on x86_64. This might be
expanded in the future to use 'movntdqa' if the copy size is above
some threshold, or expanded with support for other architectures [4].
3/ makes cache maintenance optional by arranging for dax to call driver
specific copy and flush operations only if the driver publishes them.
4/ allows filesytem-dax cache management to be controlled by the block
device write-cache queue flag. The pmem driver is updated to clear
that flag by default when pmem is driving volatile memory.
[3]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
[4]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009478.html
These patches have been through a round of build regression fixes
notified by the 0day robot. All review welcome, but the patches that
need extra attention are the device-mapper and uio changes
(copy_from_iter_ops).
This series is based on a merge of char-misc-next (for cdev api reworks)
and libnvdimm-fixes (dax locking and __copy_user_nocache fixes).
---
Dan Williams (33):
device-dax: rename 'dax_dev' to 'dev_dax'
dax: refactor dax-fs into a generic provider of 'struct dax_device' instances
dax: add a facility to lookup a dax device by 'host' device name
dax: introduce dax_operations
pmem: add dax_operations support
axon_ram: add dax_operations support
brd: add dax_operations support
dcssblk: add dax_operations support
block: kill bdev_dax_capable()
dax: introduce dax_direct_access()
dm: add dax_device and dax_operations support
dm: teach dm-targets to use a dax_device + dax_operations
ext2, ext4, xfs: retrieve dax_device for iomap operations
Revert "block: use DAX for partition table reads"
filesystem-dax: convert to dax_direct_access()
block, dax: convert bdev_dax_supported() to dax_direct_access()
block: remove block_device_operations ->direct_access()
x86, dax, pmem: remove indirection around memcpy_from_pmem()
dax, pmem: introduce 'copy_from_iter' dax operation
dm: add ->copy_from_iter() dax operation support
filesystem-dax: convert to dax_copy_from_iter()
dax, pmem: introduce an optional 'flush' dax_operation
dm: add ->flush() dax operation support
filesystem-dax: convert to dax_flush()
x86, dax: replace clear_pmem() with open coded memset + dax_ops->flush
x86, dax, libnvdimm: move wb_cache_pmem() to libnvdimm
x86, libnvdimm, pmem: move arch_invalidate_pmem() to libnvdimm
x86, libnvdimm, dax: stop abusing __copy_user_nocache
uio, libnvdimm, pmem: implement cache bypass for all copy_from_iter() operations
libnvdimm, pmem: fix persistence warning
libnvdimm, nfit: enable support for volatile ranges
filesystem-dax: gate calls to dax_flush() on QUEUE_FLAG_WC
libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region
MAINTAINERS | 2
arch/powerpc/platforms/Kconfig | 1
arch/powerpc/sysdev/axonram.c | 45 +++-
arch/x86/Kconfig | 1
arch/x86/include/asm/pmem.h | 141 ------------
arch/x86/include/asm/string_64.h | 1
block/Kconfig | 1
block/partition-generic.c | 17 -
drivers/Makefile | 2
drivers/acpi/nfit/core.c | 15 +
drivers/block/Kconfig | 1
drivers/block/brd.c | 52 +++-
drivers/dax/Kconfig | 10 +
drivers/dax/Makefile | 5
drivers/dax/dax.h | 15 -
drivers/dax/device-dax.h | 25 ++
drivers/dax/device.c | 415 +++++++++++------------------------
drivers/dax/pmem.c | 10 -
drivers/dax/super.c | 445 ++++++++++++++++++++++++++++++++++++++
drivers/md/Kconfig | 1
drivers/md/dm-core.h | 1
drivers/md/dm-linear.c | 53 ++++-
drivers/md/dm-snap.c | 6 -
drivers/md/dm-stripe.c | 65 ++++--
drivers/md/dm-target.c | 6 -
drivers/md/dm.c | 112 ++++++++--
drivers/nvdimm/Kconfig | 6 +
drivers/nvdimm/Makefile | 1
drivers/nvdimm/bus.c | 10 -
drivers/nvdimm/claim.c | 9 -
drivers/nvdimm/core.c | 2
drivers/nvdimm/dax_devs.c | 2
drivers/nvdimm/dimm_devs.c | 2
drivers/nvdimm/namespace_devs.c | 9 -
drivers/nvdimm/nd-core.h | 9 +
drivers/nvdimm/pfn_devs.c | 4
drivers/nvdimm/pmem.c | 82 +++++--
drivers/nvdimm/pmem.h | 26 ++
drivers/nvdimm/region_devs.c | 39 ++-
drivers/nvdimm/x86.c | 155 +++++++++++++
drivers/s390/block/Kconfig | 1
drivers/s390/block/dcssblk.c | 44 +++-
fs/block_dev.c | 117 +++-------
fs/dax.c | 302 ++++++++++++++------------
fs/ext2/inode.c | 9 +
fs/ext4/inode.c | 9 +
fs/iomap.c | 3
fs/xfs/xfs_iomap.c | 10 +
include/linux/blkdev.h | 19 --
include/linux/dax.h | 43 +++-
include/linux/device-mapper.h | 14 +
include/linux/iomap.h | 1
include/linux/libnvdimm.h | 10 +
include/linux/pmem.h | 165 --------------
include/linux/string.h | 8 +
include/linux/uio.h | 4
lib/Kconfig | 6 -
lib/iov_iter.c | 25 ++
tools/testing/nvdimm/Kbuild | 11 +
tools/testing/nvdimm/pmem-dax.c | 21 +-
60 files changed, 1584 insertions(+), 1042 deletions(-)
delete mode 100644 arch/x86/include/asm/pmem.h
create mode 100644 drivers/dax/device-dax.h
rename drivers/dax/{dax.c => device.c} (60%)
create mode 100644 drivers/dax/super.c
create mode 100644 drivers/nvdimm/x86.c
delete mode 100644 include/linux/pmem.h
4 years, 11 months
[PATCH] acpi, nfit: fix the memory error check in nfit_handle_mce
by Vishal Verma
The check for an MCE being a memory error in the NFIT mce handler was
bogus. Fix it to check for the correct MCA status compound error code.
Reported-by: Tony Luck <tony.luck(a)intel.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Vishal Verma <vishal.l.verma(a)intel.com>
---
drivers/acpi/nfit/mce.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
index 3ba1c34..23e12a0 100644
--- a/drivers/acpi/nfit/mce.c
+++ b/drivers/acpi/nfit/mce.c
@@ -26,7 +26,7 @@ static int nfit_handle_mce(struct notifier_block *nb, unsigned long val,
struct nfit_spa *nfit_spa;
/* We only care about memory errors */
- if (!(mce->status & MCACOD))
+ if (!(mce->status & 0xef80) == BIT(7))
return NOTIFY_DONE;
/*
--
2.9.3
5 years, 1 month
[PATCH] nvdimm: Export supported alignments via sysfs
by Oliver O'Halloran
Adds two new sysfs attributes for pfn (and dax) devices:
supported_alignements and default_alignment. These advertise to
userspace what alignments this kernel supports, and provides a nominal
default alignment to use.
Signed-off-by: Oliver O'Halloran <oohall(a)gmail.com>
---
I'm not sure it makes sense to provide these for pfn devices. In the dax
case we have hard restrictions because of how fault handling works, but
I'm not convinced this makes sense for the pfn case since it's going to
be used with fs-dax.
---
drivers/nvdimm/pfn_devs.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 6c033c9a2f06..5157e7d89f0b 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -260,6 +260,30 @@ static ssize_t size_show(struct device *dev,
}
static DEVICE_ATTR_RO(size);
+static ssize_t supported_alignments_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ /* Fun fact: These aren't always constants! */
+ unsigned long supported_alignments[] = {
+ PAGE_SIZE,
+ HPAGE_PMD_SIZE,
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+ HPAGE_PUD_SIZE,
+#endif
+ 0,
+ };
+
+ return nd_sector_size_show(0, supported_alignments, buf);
+}
+DEVICE_ATTR_RO(supported_alignments);
+
+static ssize_t default_alignment_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%ld\n", HPAGE_PMD_SIZE);
+}
+DEVICE_ATTR_RO(default_alignment);
+
static struct attribute *nd_pfn_attributes[] = {
&dev_attr_mode.attr,
&dev_attr_namespace.attr,
@@ -267,6 +291,8 @@ static struct attribute *nd_pfn_attributes[] = {
&dev_attr_align.attr,
&dev_attr_resource.attr,
&dev_attr_size.attr,
+ &dev_attr_supported_alignments.attr,
+ &dev_attr_default_alignment.attr,
NULL,
};
--
2.9.3
5 years, 1 month
[PATCH] nvdimm, btt: make sure initializing new metadata clears poison
by Vishal Verma
If we had badblocks/poison in the metadata area of a BTT, recreating the
BTT would not clear the poison in all cases, notably the flog area. This
is because rw_bytes will only clear errors if the request being sent
down is 512B aligned and sized.
Make sure that when writing the map and info blocks, the rw_bytes being
sent are of the correct size/alignment. For the flog, instead of doing
the smaller log_entry writes only, first do a 'wipe' of the entire area
by writing zeroes in large enough chunks so that errors get cleared.
Signed-off-by: Vishal Verma <vishal.l.verma(a)intel.com>
---
drivers/nvdimm/btt.c | 54 +++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 47 insertions(+), 7 deletions(-)
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 368795a..6054e83 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -57,6 +57,14 @@ static int btt_info_write(struct arena_info *arena, struct btt_sb *super)
{
int ret;
+ /*
+ * infooff and info2off should always be at least 512B aligned.
+ * We rely on that to make sure rw_bytes does error clearing
+ * correctly, so make sure that is the case.
+ */
+ WARN_ON_ONCE(!IS_ALIGNED(arena->infooff, 512));
+ WARN_ON_ONCE(!IS_ALIGNED(arena->info2off, 512));
+
ret = arena_write_bytes(arena, arena->info2off, super,
sizeof(struct btt_sb));
if (ret)
@@ -393,9 +401,17 @@ static int btt_map_init(struct arena_info *arena)
if (!zerobuf)
return -ENOMEM;
+ /*
+ * mapoff should always be at least 512B aligned. We rely on that to
+ * make sure rw_bytes does error clearing correctly, so make sure that
+ * is the case.
+ */
+ WARN_ON_ONCE(!IS_ALIGNED(arena->mapoff, 512));
+
while (mapsize) {
size_t size = min(mapsize, chunk_size);
+ WARN_ON_ONCE(size < 512);
ret = arena_write_bytes(arena, arena->mapoff + offset, zerobuf,
size);
if (ret)
@@ -417,11 +433,36 @@ static int btt_map_init(struct arena_info *arena)
*/
static int btt_log_init(struct arena_info *arena)
{
+ size_t logsize = arena->info2off - arena->logoff;
+ size_t chunk_size = SZ_4K, offset = 0;
+ struct log_entry log;
+ void *zerobuf;
int ret;
u32 i;
- struct log_entry log, zerolog;
- memset(&zerolog, 0, sizeof(zerolog));
+ zerobuf = kzalloc(chunk_size, GFP_KERNEL);
+ if (!zerobuf)
+ return -ENOMEM;
+ /*
+ * logoff should always be at least 512B aligned. We rely on that to
+ * make sure rw_bytes does error clearing correctly, so make sure that
+ * is the case.
+ */
+ WARN_ON_ONCE(!IS_ALIGNED(arena->logoff, 512));
+
+ while (logsize) {
+ size_t size = min(logsize, chunk_size);
+
+ WARN_ON_ONCE(size < 512);
+ ret = arena_write_bytes(arena, arena->logoff + offset, zerobuf,
+ size);
+ if (ret)
+ goto free;
+
+ offset += size;
+ logsize -= size;
+ cond_resched();
+ }
for (i = 0; i < arena->nfree; i++) {
log.lba = cpu_to_le32(i);
@@ -430,13 +471,12 @@ static int btt_log_init(struct arena_info *arena)
log.seq = cpu_to_le32(LOG_SEQ_INIT);
ret = __btt_log_write(arena, i, 0, &log);
if (ret)
- return ret;
- ret = __btt_log_write(arena, i, 1, &zerolog);
- if (ret)
- return ret;
+ goto free;
}
- return 0;
+ free:
+ kfree(zerobuf);
+ return ret;
}
static int btt_freelist_init(struct arena_info *arena)
--
2.9.3
5 years, 1 month
[NAK] copy_from_iter_ops()
by Al Viro
I should have looked and commented earlier, but I hadn't spotted
that thing until -next conflicts had shown up. As the matter of fact,
I don't have this series in my mailbox - it had been Cc'd my way, apparently,
but it looks like it never made it there, so I'm posting from scratch instead
of replying. Sorry.
The following "primitive" is complete crap
+#ifdef CONFIG_COPY_FROM_ITER_OPS
+size_t copy_from_iter_ops(void *addr, size_t bytes, struct iov_iter *i,
+ int (*user)(void *, const void __user *, unsigned),
+ void (*page)(char *, struct page *, size_t, size_t),
+ void (*copy)(void *, void *, unsigned))
+{
+ char *to = addr;
+
+ if (unlikely(i->type & ITER_PIPE)) {
+ WARN_ON(1);
+ return 0;
+ }
+ iterate_and_advance(i, bytes, v,
+ user((to += v.iov_len) - v.iov_len, v.iov_base,
+ v.iov_len),
+ page((to += v.bv_len) - v.bv_len, v.bv_page, v.bv_offset,
+ v.bv_len),
+ copy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
+ )
+
+ return bytes;
+}
+EXPORT_SYMBOL_GPL(copy_from_iter_ops);
+#endif
1) Every time we get a new copy-from flavour of iov_iter, you will
need an extra argument and every caller will need to be updated.
2) If it's a general-purpose primitive, it should *not* be
behind a CONFIG_<whatever> to be selected by callers. If it isn't,
it shouldn't be there at all, period. And no, EXPORT_SYMBOL_GPL doesn't
make it any better.
3) The caller makes very little sense. Is that thing meant to
be x86-only? What are the requirements regarding writeback? Is that thing
just go-fast stripes, or...? Basically, all questions asked back in Decemeber
thread (memcpy_nocache()) still apply.
I strongly object to that interface. Let's figure out what's
really needed for your copy_from_iter_pmem() and bloody put the
iterator-related part (without the callbacks, etc.) into lib/iov_iter.c
With memcpy_to_pmem() and pmem_from_user() used by it.
Incidentally, your fallback for memcpy_to_pmem() is... odd.
It used to be "just use memcpy()" and now it's "just do nothing". What
the hell? If it's really "you should not use that if you don't have
arch-specific variant", let it at least BUG(), if not fail to link.
On the uaccess side, should pmem_from_user() zero what it had failed
to copy? And for !@#!@# sake, comments like this
+ * On x86_64 __copy_from_user_nocache() uses non-temporal stores
+ * for the bulk of the transfer, but we need to manually flush
+ * if the transfer is unaligned. A cached memory copy is used
+ * when destination or size is not naturally aligned. That is:
+ * - Require 8-byte alignment when size is 8 bytes or larger.
+ * - Require 4-byte alignment when size is 4 bytes.
mean only one thing: this should live in arch/x86/lib/usercopy_64.c,
right next to the actual function that does copying. NOT in
drivers/nvdimm/x86.c. At the very least it needs a comment in usercopy_64.c
with dire warnings along the lines of "don't touch that code without
looking into <filename>:pmem_from_user()"...
5 years, 1 month
[PATCH 1/2] dax: prevent invalidation of mapped DAX entries
by Ross Zwisler
dax_invalidate_mapping_entry() currently removes DAX exceptional entries
only if they are clean and unlocked. This is done via:
invalidate_mapping_pages()
invalidate_exceptional_entry()
dax_invalidate_mapping_entry()
However, for page cache pages removed in invalidate_mapping_pages() there
is an additional criteria which is that the page must not be mapped. This
is noted in the comments above invalidate_mapping_pages() and is checked in
invalidate_inode_page().
For DAX entries this means that we can can end up in a situation where a
DAX exceptional entry, either a huge zero page or a regular DAX entry,
could end up mapped but without an associated radix tree entry. This is
inconsistent with the rest of the DAX code and with what happens in the
page cache case.
We aren't able to unmap the DAX exceptional entry because according to its
comments invalidate_mapping_pages() isn't allowed to block, and
unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.
Since we essentially never have unmapped DAX entries to evict from the
radix tree, just remove dax_invalidate_mapping_entry().
Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate")
Reported-by: Jan Kara <jack(a)suse.cz>
Cc: <stable(a)vger.kernel.org> [4.10+]
---
This series applies cleanly to the current v4.11-rc7 based linux/master,
and has passed an xfstests run with DAX on ext4 and XFS.
These patches also apply to v4.10.9 with a little work from the 3-way
merge feature.
fs/dax.c | 29 -----------------------------
include/linux/dax.h | 1 -
mm/truncate.c | 9 +++------
3 files changed, 3 insertions(+), 36 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 85abd74..166504c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -507,35 +507,6 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
}
/*
- * Invalidate exceptional DAX entry if easily possible. This handles DAX
- * entries for invalidate_inode_pages() so we evict the entry only if we can
- * do so without blocking.
- */
-int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index)
-{
- int ret = 0;
- void *entry, **slot;
- struct radix_tree_root *page_tree = &mapping->page_tree;
-
- spin_lock_irq(&mapping->tree_lock);
- entry = __radix_tree_lookup(page_tree, index, NULL, &slot);
- if (!entry || !radix_tree_exceptional_entry(entry) ||
- slot_locked(mapping, slot))
- goto out;
- if (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
- radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
- goto out;
- radix_tree_delete(page_tree, index);
- mapping->nrexceptional--;
- ret = 1;
-out:
- spin_unlock_irq(&mapping->tree_lock);
- if (ret)
- dax_wake_mapping_entry_waiter(mapping, index, entry, true);
- return ret;
-}
-
-/*
* Invalidate exceptional DAX entry if it is clean.
*/
int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d8a3dc0..f8e1833 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -41,7 +41,6 @@ ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
const struct iomap_ops *ops);
int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
-int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index);
int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
pgoff_t index);
void dax_wake_mapping_entry_waiter(struct address_space *mapping,
diff --git a/mm/truncate.c b/mm/truncate.c
index 6263aff..c537184 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -67,17 +67,14 @@ static void truncate_exceptional_entry(struct address_space *mapping,
/*
* Invalidate exceptional entry if easily possible. This handles exceptional
- * entries for invalidate_inode_pages() so for DAX it evicts only unlocked and
- * clean entries.
+ * entries for invalidate_inode_pages().
*/
static int invalidate_exceptional_entry(struct address_space *mapping,
pgoff_t index, void *entry)
{
- /* Handled by shmem itself */
- if (shmem_mapping(mapping))
+ /* Handled by shmem itself, or for DAX we do nothing. */
+ if (shmem_mapping(mapping) || dax_mapping(mapping))
return 1;
- if (dax_mapping(mapping))
- return dax_invalidate_mapping_entry(mapping, index);
clear_shadow_entry(mapping, index, entry);
return 1;
}
--
2.9.3
5 years, 2 months
task ndctl:5155 blocked for more than 120 seconds observed during pmem/btt/dax switch test
by Yi Zhang
Hello
I reproduced ndctl blocked issue on 4.11.0-rc8, here is the reproduce steps and kernel log, could you help check it? Thanks.
Reproduce steps:
function pmem_btt_dax_switch() {
sector_size_list="512 520 528 4096 4104 4160 4224"
for sector_size in $sector_size_list; do
ndctl create-namespace -f -e namespace${1}.0 --mode=sector -l $sector_size
ndctl create-namespace -f -e namespace${1}.0 --mode=raw
ndctl create-namespace -f -e namespace${1}.0 --mode=dax
done
}
for i in 0 1 2 3; do
pmem_btt_dax_switch $i &
done
kernel log:
[ 6026.482747] INFO: task ndctl:5155 blocked for more than 120 seconds.
[ 6026.514573] Not tainted 4.11.0-rc8 #1
[ 6026.535467] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6026.573932] ndctl D 0 5155 5154 0x00000080
[ 6026.600220] Call Trace:
[ 6026.611766] __schedule+0x289/0x8f0
[ 6026.628026] schedule+0x36/0x80
[ 6026.642725] schedule_preempt_disabled+0xe/0x10
[ 6026.663804] __mutex_lock.isra.8+0x266/0x500
[ 6026.683820] ? mntput+0x24/0x40
[ 6026.698596] __mutex_lock_slowpath+0x13/0x20
[ 6026.718558] mutex_lock+0x2f/0x40
[ 6026.734046] region_size_show+0x20/0x70 [dax]
[ 6026.754563] dev_attr_show+0x20/0x50
[ 6026.771246] ? mutex_lock+0x12/0x40
[ 6026.787201] sysfs_kf_seq_show+0xbf/0x1a0
[ 6026.805510] kernfs_seq_show+0x21/0x30
[ 6026.823174] seq_read+0x115/0x390
[ 6026.838263] ? do_filp_open+0xa5/0x100
[ 6026.855906] kernfs_fop_read+0xff/0x180
[ 6026.873983] __vfs_read+0x37/0x150
[ 6026.889786] ? security_file_permission+0x9d/0xc0
[ 6026.911642] vfs_read+0x8c/0x130
[ 6026.926874] SyS_read+0x55/0xc0
[ 6026.941636] do_syscall_64+0x67/0x180
[ 6026.959003] entry_SYSCALL64_slow_path+0x25/0x25
[ 6026.980692] RIP: 0033:0x7f24eba9c7e0
[ 6026.999534] RSP: 002b:00007fff94cbb658 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 6027.035833] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f24eba9c7e0
[ 6027.071099] RDX: 0000000000000400 RSI: 00007fff94cbb680 RDI: 0000000000000004
[ 6027.106350] RBP: 0000000001d784e0 R08: 00007f24eb9fb988 R09: 0000000000000027
[ 6027.141119] R10: 000000000000000a R11: 0000000000000246 R12: 00007fff94cbb680
[ 6027.175009] R13: 0000000001d73270 R14: 00007fff94cbb680 R15: 0000000001d7b333
[ 6027.208899] INFO: task ndctl:5164 blocked for more than 120 seconds.
[ 6027.238487] Not tainted 4.11.0-rc8 #1
[ 6027.258025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6027.296084] ndctl D 0 5164 5163 0x00000080
[ 6027.321726] Call Trace:
[ 6027.333199] __schedule+0x289/0x8f0
[ 6027.349688] schedule+0x36/0x80
[ 6027.364463] schedule_preempt_disabled+0xe/0x10
[ 6027.385667] __mutex_lock.isra.8+0x266/0x500
[ 6027.405824] ? refcount_dec_and_test+0x11/0x20
[ 6027.426656] ? wait_probe_show+0x70/0x70 [libnvdimm]
[ 6027.449966] __mutex_lock_slowpath+0x13/0x20
[ 6027.470000] mutex_lock+0x2f/0x40
[ 6027.485369] flush_regions_dimms+0x1b/0x40 [libnvdimm]
[ 6027.509549] device_for_each_child+0x50/0x90
[ 6027.529466] wait_probe_show+0x46/0x70 [libnvdimm]
[ 6027.551543] dev_attr_show+0x20/0x50
[ 6027.569666] ? mutex_lock+0x12/0x40
[ 6027.586494] sysfs_kf_seq_show+0xbf/0x1a0
[ 6027.607243] kernfs_seq_show+0x21/0x30
[ 6027.625886] seq_read+0x115/0x390
[ 6027.641497] ? do_filp_open+0xa5/0x100
[ 6027.659110] kernfs_fop_read+0xff/0x180
[ 6027.677120] __vfs_read+0x37/0x150
[ 6027.692972] ? security_file_permission+0x9d/0xc0
[ 6027.714948] vfs_read+0x8c/0x130
[ 6027.730083] SyS_read+0x55/0xc0
[ 6027.745087] do_syscall_64+0x67/0x180
[ 6027.762273] entry_SYSCALL64_slow_path+0x25/0x25
[ 6027.784092] RIP: 0033:0x7f08e08527e0
[ 6027.800715] RSP: 002b:00007fff5ffcd358 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 6027.836082] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f08e08527e0
[ 6027.869667] RDX: 0000000000000400 RSI: 00007fff5ffcd380 RDI: 0000000000000003
[ 6027.904697] RBP: 0000000000000000 R08: 00007f08e07b1988 R09: 0000000000000046
[ 6027.938016] R10: 0000000000000046 R11: 0000000000000246 R12: 00007fff5ffcd380
[ 6027.970932] R13: 0000000000000000 R14: 0000000000001388 R15: 00007fff5ffcd380
[ 6028.004331] INFO: task ndctl:5172 blocked for more than 120 seconds.
[ 6028.034311] Not tainted 4.11.0-rc8 #1
[ 6028.053796] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6028.092317] ndctl D 0 5172 5171 0x00000080
[ 6028.120694] Call Trace:
[ 6028.132134] __schedule+0x289/0x8f0
[ 6028.148496] schedule+0x36/0x80
[ 6028.163221] schedule_preempt_disabled+0xe/0x10
[ 6028.184498] __mutex_lock.isra.8+0x266/0x500
[ 6028.204502] ? refcount_dec_and_test+0x11/0x20
[ 6028.225383] ? wait_probe_show+0x70/0x70 [libnvdimm]
[ 6028.248818] __mutex_lock_slowpath+0x13/0x20
[ 6028.268915] mutex_lock+0x2f/0x40
[ 6028.284572] flush_regions_dimms+0x1b/0x40 [libnvdimm]
[ 6028.308483] device_for_each_child+0x50/0x90
[ 6028.328625] wait_probe_show+0x46/0x70 [libnvdimm]
[ 6028.351106] dev_attr_show+0x20/0x50
[ 6028.367457] ? mutex_lock+0x12/0x40
[ 6028.383180] sysfs_kf_seq_show+0xbf/0x1a0
[ 6028.401459] kernfs_seq_show+0x21/0x30
[ 6028.418997] seq_read+0x115/0x390
[ 6028.434451] ? do_filp_open+0xa5/0x100
[ 6028.451975] kernfs_fop_read+0xff/0x180
[ 6028.469849] __vfs_read+0x37/0x150
[ 6028.485746] ? security_file_permission+0x9d/0xc0
[ 6028.507435] vfs_read+0x8c/0x130
[ 6028.522452] SyS_read+0x55/0xc0
[ 6028.537079] do_syscall_64+0x67/0x180
[ 6028.554153] entry_SYSCALL64_slow_path+0x25/0x25
[ 6028.575778] RIP: 0033:0x7eff768387e0
[ 6028.592970] RSP: 002b:00007ffcf5367668 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 6028.631343] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007eff768387e0
[ 6028.664793] RDX: 0000000000000400 RSI: 00007ffcf5367690 RDI: 0000000000000003
[ 6028.698191] RBP: 0000000000000000 R08: 00007eff76797988 R09: 0000000000000046
[ 6028.731690] R10: 0000000000000046 R11: 0000000000000246 R12: 00007ffcf5367690
[ 6028.765029] R13: 0000000000000000 R14: 0000000000001388 R15: 00007ffcf5367690
[ 6028.798470] INFO: task ndctl:5180 blocked for more than 120 seconds.
[ 6028.828412] Not tainted 4.11.0-rc8 #1
[ 6028.848058] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6028.884846] ndctl D 0 5180 5179 0x00000080
[ 6028.910311] Call Trace:
[ 6028.921891] __schedule+0x289/0x8f0
[ 6028.938354] schedule+0x36/0x80
[ 6028.952914] __kernfs_remove+0x169/0x220
[ 6028.971210] ? remove_wait_queue+0x60/0x60
[ 6028.990431] kernfs_remove_by_name_ns+0x43/0xa0
[ 6029.011866] remove_files.isra.1+0x36/0x70
[ 6029.032520] sysfs_remove_group+0x44/0x90
[ 6029.051185] sysfs_remove_groups+0x2e/0x50
[ 6029.070831] dax_region_unregister+0x21/0x40 [dax]
[ 6029.093260] devm_action_release+0xf/0x20
[ 6029.113529] release_nodes+0x218/0x260
[ 6029.132924] devres_release_all+0x3c/0x60
[ 6029.152249] device_release_driver_internal+0x151/0x1f0
[ 6029.176701] device_release_driver+0x12/0x20
[ 6029.196651] unbind_store+0xba/0xe0
[ 6029.213026] drv_attr_store+0x24/0x30
[ 6029.229987] sysfs_kf_write+0x3a/0x50
[ 6029.247412] kernfs_fop_write+0xff/0x180
[ 6029.265909] __vfs_write+0x37/0x160
[ 6029.282231] ? selinux_file_permission+0xe5/0x120
[ 6029.304504] ? security_file_permission+0x3b/0xc0
[ 6029.326647] vfs_write+0xb2/0x1b0
[ 6029.341929] ? syscall_trace_enter+0x1d0/0x2b0
[ 6029.362863] SyS_write+0x55/0xc0
[ 6029.377955] do_syscall_64+0x67/0x180
[ 6029.395080] entry_SYSCALL64_slow_path+0x25/0x25
[ 6029.416677] RIP: 0033:0x7f83a79b7840
[ 6029.433311] RSP: 002b:00007ffca25e4198 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 6029.468729] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f83a79b7840
[ 6029.502221] RDX: 0000000000000007 RSI: 00000000016deb90 RDI: 0000000000000003
[ 6029.535277] RBP: 00000000016deb90 R08: 00007f83a7916988 R09: 0000000000000046
[ 6029.568341] R10: 00007ffca25e3eb0 R11: 0000000000000246 R12: 0000000000000007
[ 6029.601701] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000002
#ps aux | grep ndctl
root 5155 0.0 0.0 41576 3044 pts/0 D+ 10:53 0:00 ndctl create-namespace -f -e namespace2.0 --mode=dax
root 5164 0.0 0.0 41576 3040 pts/0 D+ 10:53 0:00 ndctl create-namespace -f -e namespace0.0 --mode=dax
root 5172 0.1 0.0 41576 3024 pts/0 D+ 10:53 0:00 ndctl create-namespace -f -e namespace3.0 --mode=dax
root 5180 0.0 0.0 41576 3036 pts/0 D+ 10:53 0:00 ndctl create-namespace -f -e namespace1.0 --mode=sector -l 528
# cat /proc/5155/stack
[<ffffffffc096f320>] region_size_show+0x20/0x70 [dax]
[<ffffffffbeae2fb0>] dev_attr_show+0x20/0x50
[<ffffffffbe8ca08f>] sysfs_kf_seq_show+0xbf/0x1a0
[<ffffffffbe8c8741>] kernfs_seq_show+0x21/0x30
[<ffffffffbe866f65>] seq_read+0x115/0x390
[<ffffffffbe8c8ebf>] kernfs_fop_read+0xff/0x180
[<ffffffffbe83ebe7>] __vfs_read+0x37/0x150
[<ffffffffbe83fb2c>] vfs_read+0x8c/0x130
[<ffffffffbe841105>] SyS_read+0x55/0xc0
[<ffffffffbe603a47>] do_syscall_64+0x67/0x180
[<ffffffffbed5602b>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff
# cat /proc/5164/stack
[<ffffffffc0bf720b>] flush_regions_dimms+0x1b/0x40 [libnvdimm]
[<ffffffffbeae2b30>] device_for_each_child+0x50/0x90
[<ffffffffc0bf71c6>] wait_probe_show+0x46/0x70 [libnvdimm]
[<ffffffffbeae2fb0>] dev_attr_show+0x20/0x50
[<ffffffffbe8ca08f>] sysfs_kf_seq_show+0xbf/0x1a0
[<ffffffffbe8c8741>] kernfs_seq_show+0x21/0x30
[<ffffffffbe866f65>] seq_read+0x115/0x390
[<ffffffffbe8c8ebf>] kernfs_fop_read+0xff/0x180
[<ffffffffbe83ebe7>] __vfs_read+0x37/0x150
[<ffffffffbe83fb2c>] vfs_read+0x8c/0x130
[<ffffffffbe841105>] SyS_read+0x55/0xc0
[<ffffffffbe603a47>] do_syscall_64+0x67/0x180
[<ffffffffbed5602b>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff
# cat /proc/5172/stack
[<ffffffffc0bf720b>] flush_regions_dimms+0x1b/0x40 [libnvdimm]
[<ffffffffbeae2b30>] device_for_each_child+0x50/0x90
[<ffffffffc0bf71c6>] wait_probe_show+0x46/0x70 [libnvdimm]
[<ffffffffbeae2fb0>] dev_attr_show+0x20/0x50
[<ffffffffbe8ca08f>] sysfs_kf_seq_show+0xbf/0x1a0
[<ffffffffbe8c8741>] kernfs_seq_show+0x21/0x30
[<ffffffffbe866f65>] seq_read+0x115/0x390
[<ffffffffbe8c8ebf>] kernfs_fop_read+0xff/0x180
[<ffffffffbe83ebe7>] __vfs_read+0x37/0x150
[<ffffffffbe83fb2c>] vfs_read+0x8c/0x130
[<ffffffffbe841105>] SyS_read+0x55/0xc0
[<ffffffffbe603a47>] do_syscall_64+0x67/0x180
[<ffffffffbed5602b>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff
# cat /proc/5180/stack
[<ffffffffbe8c7669>] __kernfs_remove+0x169/0x220
[<ffffffffbe8c8523>] kernfs_remove_by_name_ns+0x43/0xa0
[<ffffffffbe8cad26>] remove_files.isra.1+0x36/0x70
[<ffffffffbe8cb0e4>] sysfs_remove_group+0x44/0x90
[<ffffffffbe8cb1de>] sysfs_remove_groups+0x2e/0x50
[<ffffffffc09700a1>] dax_region_unregister+0x21/0x40 [dax]
[<ffffffffbeaec2ef>] devm_action_release+0xf/0x20
[<ffffffffbeaed038>] release_nodes+0x218/0x260
[<ffffffffbeaed28c>] devres_release_all+0x3c/0x60
[<ffffffffbeae8d71>] device_release_driver_internal+0x151/0x1f0
[<ffffffffbeae8e22>] device_release_driver+0x12/0x20
[<ffffffffbeae6a3a>] unbind_store+0xba/0xe0
[<ffffffffbeae6034>] drv_attr_store+0x24/0x30
[<ffffffffbe8c9c3a>] sysfs_kf_write+0x3a/0x50
[<ffffffffbe8c971f>] kernfs_fop_write+0xff/0x180
[<ffffffffbe83ed37>] __vfs_write+0x37/0x160
[<ffffffffbe83fc82>] vfs_write+0xb2/0x1b0
[<ffffffffbe8411c5>] SyS_write+0x55/0xc0
[<ffffffffbe603a47>] do_syscall_64+0x67/0x180
[<ffffffffbed5602b>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff
Best Regards,
Yi Zhang
5 years, 2 months
[PATCH] libnvdimm: rework region badblocks clearing
by Dan Williams
Toshi noticed that the new support for a region-level badblocks missed
the case where errors are cleared due to BTT I/O.
An initial attempt to fix this ran into a "sleeping while atomic"
warning due to taking the nvdimm_bus_lock() in the BTT I/O path to
satisfy the locking requirements of __nvdimm_bus_badblocks_clear().
However, that lock is not needed since we are not acting any data that
is subject to change due to a change of state of the bus / region. The
badblocks instance has its own internal lock to handle mutations of the
error list.
So, to make it clear that we are just acting on region devices and don't
need the lock rename __nvdimm_bus_badblocks_clear() to
nvdimm_clear_badblocks_regions(). Eliminate the lock and consolidate all
routines in drivers/nvdimm/bus.c. Also, make some cleanups to remove
unnecessary casts, make the calling convention of
nvdimm_clear_badblocks_regions() clearer by replacing struct resource
with the minimal struct clear_badblocks_context, and use the DEVICE_ATTR
macro.
Cc: Dave Jiang <dave.jiang(a)intel.com>
Cc: Vishal Verma <vishal.l.verma(a)intel.com>
Reported-by: Toshi Kani <toshi.kani(a)hpe.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
drivers/nvdimm/bus.c | 76 ++++++++++++++++++++++++++++++------------
drivers/nvdimm/region.c | 25 --------------
drivers/nvdimm/region_devs.c | 15 +++-----
include/linux/libnvdimm.h | 3 --
4 files changed, 59 insertions(+), 60 deletions(-)
diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 43ddfd487c85..e9361bffe5ee 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -172,6 +172,57 @@ void nvdimm_region_notify(struct nd_region *nd_region, enum nvdimm_event event)
}
EXPORT_SYMBOL_GPL(nvdimm_region_notify);
+struct clear_badblocks_context {
+ resource_size_t phys, cleared;
+};
+
+static int nvdimm_clear_badblocks_region(struct device *dev, void *data)
+{
+ struct clear_badblocks_context *ctx = data;
+ struct nd_region *nd_region;
+ resource_size_t ndr_end;
+ sector_t sector;
+
+ /* make sure device is a region */
+ if (!is_nd_pmem(dev))
+ return 0;
+
+ nd_region = to_nd_region(dev);
+ ndr_end = nd_region->ndr_start + nd_region->ndr_size - 1;
+
+ /* make sure we are in the region */
+ if (ctx->phys < nd_region->ndr_start
+ || (ctx->phys + ctx->cleared) > ndr_end)
+ return 0;
+
+ sector = (ctx->phys - nd_region->ndr_start) / 512;
+ badblocks_clear(&nd_region->bb, sector, ctx->cleared / 512);
+
+ return 0;
+}
+
+static void nvdimm_clear_badblocks_regions(struct nvdimm_bus *nvdimm_bus,
+ phys_addr_t phys, u64 cleared)
+{
+ struct clear_badblocks_context ctx = {
+ .phys = phys,
+ .cleared = cleared,
+ };
+
+ device_for_each_child(&nvdimm_bus->dev, &ctx,
+ nvdimm_clear_badblocks_region);
+}
+
+static void nvdimm_account_cleared_poison(struct nvdimm_bus *nvdimm_bus,
+ phys_addr_t phys, u64 cleared)
+{
+ if (cleared > 0)
+ nvdimm_forget_poison(nvdimm_bus, phys, cleared);
+
+ if (cleared > 0 && cleared / 512)
+ nvdimm_clear_badblocks_regions(nvdimm_bus, phys, cleared);
+}
+
long nvdimm_clear_poison(struct device *dev, phys_addr_t phys,
unsigned int len)
{
@@ -219,22 +270,12 @@ long nvdimm_clear_poison(struct device *dev, phys_addr_t phys,
if (cmd_rc < 0)
return cmd_rc;
- if (clear_err.cleared > 0)
- nvdimm_forget_poison(nvdimm_bus, phys, clear_err.cleared);
+ nvdimm_account_cleared_poison(nvdimm_bus, phys, clear_err.cleared);
return clear_err.cleared;
}
EXPORT_SYMBOL_GPL(nvdimm_clear_poison);
-void __nvdimm_bus_badblocks_clear(struct nvdimm_bus *nvdimm_bus,
- struct resource *res)
-{
- lockdep_assert_held(&nvdimm_bus->reconfig_mutex);
- device_for_each_child(&nvdimm_bus->dev, (void *)res,
- nvdimm_region_badblocks_clear);
-}
-EXPORT_SYMBOL_GPL(__nvdimm_bus_badblocks_clear);
-
static int nvdimm_bus_match(struct device *dev, struct device_driver *drv);
static struct bus_type nvdimm_bus_type = {
@@ -989,18 +1030,9 @@ static int __nd_ioctl(struct nvdimm_bus *nvdimm_bus, struct nvdimm *nvdimm,
if (!nvdimm && cmd == ND_CMD_CLEAR_ERROR && cmd_rc >= 0) {
struct nd_cmd_clear_error *clear_err = buf;
- struct resource res;
-
- if (clear_err->cleared) {
- /* clearing the poison list we keep track of */
- nvdimm_forget_poison(nvdimm_bus, clear_err->address,
- clear_err->cleared);
- /* now sync the badblocks lists */
- res.start = clear_err->address;
- res.end = clear_err->address + clear_err->cleared - 1;
- __nvdimm_bus_badblocks_clear(nvdimm_bus, &res);
- }
+ nvdimm_account_cleared_poison(nvdimm_bus, clear_err->address,
+ clear_err->cleared);
}
nvdimm_bus_unlock(&nvdimm_bus->dev);
diff --git a/drivers/nvdimm/region.c b/drivers/nvdimm/region.c
index 23c4307d254c..869a886c292e 100644
--- a/drivers/nvdimm/region.c
+++ b/drivers/nvdimm/region.c
@@ -131,31 +131,6 @@ static void nd_region_notify(struct device *dev, enum nvdimm_event event)
device_for_each_child(dev, &event, child_notify);
}
-int nvdimm_region_badblocks_clear(struct device *dev, void *data)
-{
- struct resource *res = (struct resource *)data;
- struct nd_region *nd_region;
- resource_size_t ndr_end;
- sector_t sector;
-
- /* make sure device is a region */
- if (!is_nd_pmem(dev))
- return 0;
-
- nd_region = to_nd_region(dev);
- ndr_end = nd_region->ndr_start + nd_region->ndr_size - 1;
-
- /* make sure we are in the region */
- if (res->start < nd_region->ndr_start || res->end > ndr_end)
- return 0;
-
- sector = (res->start - nd_region->ndr_start) >> 9;
- badblocks_clear(&nd_region->bb, sector, resource_size(res) >> 9);
-
- return 0;
-}
-EXPORT_SYMBOL_GPL(nvdimm_region_badblocks_clear);
-
static struct nd_device_driver nd_region_driver = {
.probe = nd_region_probe,
.remove = nd_region_remove,
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 53d1ba4e6d99..07756b2e1cd5 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -477,20 +477,15 @@ static ssize_t read_only_store(struct device *dev,
}
static DEVICE_ATTR_RW(read_only);
-static ssize_t nd_badblocks_show(struct device *dev,
+static ssize_t region_badblocks_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct nd_region *nd_region = to_nd_region(dev);
return badblocks_show(&nd_region->bb, buf, 0);
}
-static struct device_attribute dev_attr_nd_badblocks = {
- .attr = {
- .name = "badblocks",
- .mode = S_IRUGO
- },
- .show = nd_badblocks_show,
-};
+
+static DEVICE_ATTR(badblocks, 0444, region_badblocks_show, NULL);
static ssize_t resource_show(struct device *dev,
struct device_attribute *attr, char *buf)
@@ -514,7 +509,7 @@ static struct attribute *nd_region_attributes[] = {
&dev_attr_available_size.attr,
&dev_attr_namespace_seed.attr,
&dev_attr_init_namespaces.attr,
- &dev_attr_nd_badblocks.attr,
+ &dev_attr_badblocks.attr,
&dev_attr_resource.attr,
NULL,
};
@@ -532,7 +527,7 @@ static umode_t region_visible(struct kobject *kobj, struct attribute *a, int n)
if (!is_nd_pmem(dev) && a == &dev_attr_dax_seed.attr)
return 0;
- if (!is_nd_pmem(dev) && a == &dev_attr_nd_badblocks.attr)
+ if (!is_nd_pmem(dev) && a == &dev_attr_badblocks.attr)
return 0;
if (!is_nd_pmem(dev) && a == &dev_attr_resource.attr)
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 98b207611b06..f07b1b14159a 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -162,7 +162,4 @@ void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
u64 nd_fletcher64(void *addr, size_t len, bool le);
void nvdimm_flush(struct nd_region *nd_region);
int nvdimm_has_flush(struct nd_region *nd_region);
-int nvdimm_region_badblocks_clear(struct device *dev, void *data);
-void __nvdimm_bus_badblocks_clear(struct nvdimm_bus *nvdimm_bus,
- struct resource *res);
#endif /* __LIBNVDIMM_H__ */
5 years, 2 months