[PATCH] ndctl: fix a pmd test case
by Dan Williams
With the pending kernel fixes the O_DIRECT read test is no longer
crashing the kernel. Fix the buffer size and mishandling of the file
position.
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
lib/test-dax-pmd.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/lib/test-dax-pmd.c b/lib/test-dax-pmd.c
index 7ea4e6c7bdb6..0fee7bee8817 100644
--- a/lib/test-dax-pmd.c
+++ b/lib/test-dax-pmd.c
@@ -106,12 +106,12 @@ static int test_pmd(int fd)
break;
case 1: /* test O_DIRECT of pre-faulted address */
sprintf(addr, "odirect data");
- if (write(fd2, addr, 4096) != 4096) {
+ if (pwrite(fd2, addr, 4096, 0) != 4096) {
faili(i);
rc = -ENXIO;
}
((char *) buf)[0] = 0;
- read(fd2, buf, sizeof(buf));
+ pread(fd2, buf, 4096, 0);
if (strcmp(buf, "odirect data") != 0) {
faili(i);
rc = -ENXIO;
5 years, 1 month
[GIT PULL] libnvdimm fixes for 4.4-rc2
by Williams, Dan J
Hi Linus, please pull from...
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm libnvdimm-fixes
...to receive:
1/ A collection of crash and deadlock fixes for DAX that are also
tagged for -stable. We will look to re-enable DAX pmd mappings in 4.5,
but for now 4.4 and -stable should disable it by default.
2/ A fixup to ext2 and ext4 to mirror the same warning emitted by XFS
when mounting with "-o dax"
This set has received a build success notification from the kbuild
robot.
The following changes since commit 8005c49d9aea74d382f474ce11afbbc7d7130bec:
Linux 4.4-rc1 (2015-11-15 17:00:27 -0800)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm libnvdimm-fixes
for you to fetch changes up to 2e6edc95382cc36423aff18a237173ad62d5ab52:
block: protect rw_page against device teardown (2015-11-19 13:47:10 -0800)
----------------------------------------------------------------
Dan Williams (3):
ext2, ext4: warn when mounting with dax enabled
dax: disable pmd mappings
block: protect rw_page against device teardown
Yigal Korman (1):
mm, dax: fix DAX deadlocks (COW fault)
block/blk.h | 2 --
fs/Kconfig | 6 ++++++
fs/block_dev.c | 18 ++++++++++++++++--
fs/dax.c | 4 ++++
fs/ext2/super.c | 2 ++
fs/ext4/super.c | 6 +++++-
include/linux/blkdev.h | 2 ++
mm/memory.c | 8 ++++----
8 files changed, 39 insertions(+), 9 deletions(-)
commit 2e6edc95382cc36423aff18a237173ad62d5ab52
Author: Dan Williams <dan.j.williams(a)intel.com>
Date: Thu Nov 19 13:29:28 2015 -0800
block: protect rw_page against device teardown
Fix use after free crashes like the following:
general protection fault: 0000 [#1] SMP
Call Trace:
[<ffffffffa0050216>] ? pmem_do_bvec.isra.12+0xa6/0xf0 [nd_pmem]
[<ffffffffa0050ba2>] pmem_rw_page+0x42/0x80 [nd_pmem]
[<ffffffff8128fd90>] bdev_read_page+0x50/0x60
[<ffffffff812972f0>] do_mpage_readpage+0x510/0x770
[<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
[<ffffffff811d86dc>] ? lru_cache_add+0x1c/0x50
[<ffffffff81297657>] mpage_readpages+0x107/0x170
[<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
[<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
[<ffffffff8129058d>] blkdev_readpages+0x1d/0x20
[<ffffffff811d615f>] __do_page_cache_readahead+0x28f/0x310
[<ffffffff811d6039>] ? __do_page_cache_readahead+0x169/0x310
[<ffffffff811c5abd>] ? pagecache_get_page+0x2d/0x1d0
[<ffffffff811c76f6>] filemap_fault+0x396/0x530
[<ffffffff811f816e>] __do_fault+0x4e/0xf0
[<ffffffff811fce7d>] handle_mm_fault+0x11bd/0x1b50
Cc: <stable(a)vger.kernel.org>
Cc: Jens Axboe <axboe(a)fb.com>
Cc: Alexander Viro <viro(a)zeniv.linux.org.uk>
Reported-by: kbuild test robot <lkp(a)intel.com>
Acked-by: Matthew Wilcox <willy(a)linux.intel.com>
[willy: symmetry fixups]
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
commit 0df9d41ab5d43dc5b20abc8b22a6b6d098b03994
Author: Yigal Korman <yigal(a)plexistor.com>
Date: Mon Nov 16 14:09:15 2015 +0200
mm, dax: fix DAX deadlocks (COW fault)
DAX handling of COW faults has wrong locking sequence:
dax_fault does i_mmap_lock_read
do_cow_fault does i_mmap_unlock_write
Ross's commit[1] missed a fix[2] that Kirill added to Matthew's
commit[3].
Original COW locking logic was introduced by Matthew here[4].
This should be applied to v4.3 as well.
[1] 0f90cc6609c7 mm, dax: fix DAX deadlocks
[2] 52a2b53ffde6 mm, dax: use i_mmap_unlock_write() in do_cow_fault()
[3] 843172978bb9 dax: fix race between simultaneous faults
[4] 2e4cdab0584f mm: allow page fault handlers to perform the COW
Cc: <stable(a)vger.kernel.org>
Cc: Boaz Harrosh <boaz(a)plexistor.com>
Cc: Alexander Viro <viro(a)zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner(a)redhat.com>
Cc: Jan Kara <jack(a)suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov(a)linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox(a)intel.com>
Acked-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Signed-off-by: Yigal Korman <yigal(a)plexistor.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
commit ee82c9ed41e896bd47e121d87e4628de0f2656a3
Author: Dan Williams <dan.j.williams(a)intel.com>
Date: Sun Nov 15 16:06:32 2015 -0800
dax: disable pmd mappings
While dax pmd mappings are functional in the nominal path they trigger
kernel crashes in the following paths:
BUG: unable to handle kernel paging request at ffffea0004098000
IP: [<ffffffff812362f7>] follow_trans_huge_pmd+0x117/0x3b0
[..]
Call Trace:
[<ffffffff811f6573>] follow_page_mask+0x2d3/0x380
[<ffffffff811f6708>] __get_user_pages+0xe8/0x6f0
[<ffffffff811f7045>] get_user_pages_unlocked+0x165/0x1e0
[<ffffffff8106f5b1>] get_user_pages_fast+0xa1/0x1b0
kernel BUG at arch/x86/mm/gup.c:131!
[..]
Call Trace:
[<ffffffff8106f34c>] gup_pud_range+0x1bc/0x220
[<ffffffff8106f634>] get_user_pages_fast+0x124/0x1b0
BUG: unable to handle kernel paging request at ffffea0004088000
IP: [<ffffffff81235f49>] copy_huge_pmd+0x159/0x350
[..]
Call Trace:
[<ffffffff811fad3c>] copy_page_range+0x34c/0x9f0
[<ffffffff810a0daf>] copy_process+0x1b7f/0x1e10
[<ffffffff810a11c1>] _do_fork+0x91/0x590
All of these paths are interpreting a dax pmd mapping as a transparent
huge page and making the assumption that the pfn is covered by the
memmap, i.e. that the pfn has an associated struct page. PTE mappings
do not suffer the same fate since they have the _PAGE_SPECIAL flag to
cause the gup path to fault. We can do something similar for the PMD
path, or otherwise defer pmd support for cases where a struct page is
available. For now, 4.4-rc and -stable need to disable dax pmd support
by default.
For development the "depends on BROKEN" line can be removed from
CONFIG_FS_DAX_PMD.
Cc: <stable(a)vger.kernel.org>
Cc: Jan Kara <jack(a)suse.com>
Cc: Dave Chinner <david(a)fromorbit.com>
Cc: Matthew Wilcox <willy(a)linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Reported-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
commit ef83b6e8f40bb24b92ad73b5889732346e54a793
Author: Dan Williams <dan.j.williams(a)intel.com>
Date: Tue Sep 29 15:48:11 2015 -0400
ext2, ext4: warn when mounting with dax enabled
Similar to XFS warn when mounting DAX while it is still considered under
development. Also, aspects of the DAX implementation, for example
synchronization against multiple faults and faults causing block
allocation, depend on the correct implementation in the filesystem. The
maturity of a given DAX implementation is filesystem specific.
Cc: <stable(a)vger.kernel.org>
Cc: "Theodore Ts'o" <tytso(a)mit.edu>
Cc: Matthew Wilcox <willy(a)linux.intel.com>
Cc: linux-ext4(a)vger.kernel.org
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Reported-by: Dave Chinner <david(a)fromorbit.com>
Acked-by: Jan Kara <jack(a)suse.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
diff --git a/block/blk.h b/block/blk.h
index da722eb786df..c43926d3d74d 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -72,8 +72,6 @@ void blk_dequeue_request(struct request *rq);
void __blk_queue_free_tags(struct request_queue *q);
bool __blk_end_bidi_request(struct request *rq, int error,
unsigned int nr_bytes, unsigned int bidi_bytes);
-int blk_queue_enter(struct request_queue *q, gfp_t gfp);
-void blk_queue_exit(struct request_queue *q);
void blk_freeze_queue(struct request_queue *q);
static inline void blk_queue_enter_live(struct request_queue *q)
diff --git a/fs/Kconfig b/fs/Kconfig
index da3f32f1a4e4..6ce72d8d1ee1 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -46,6 +46,12 @@ config FS_DAX
or if unsure, say N. Saying Y will increase the size of the kernel
by about 5kB.
+config FS_DAX_PMD
+ bool
+ default FS_DAX
+ depends on FS_DAX
+ depends on BROKEN
+
endif # BLOCK
# Posix ACL utility routines
diff --git a/fs/block_dev.c b/fs/block_dev.c
index bb0dfb1c7af1..c25639e907bd 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -390,9 +390,17 @@ int bdev_read_page(struct block_device *bdev, sector_t sector,
struct page *page)
{
const struct block_device_operations *ops = bdev->bd_disk->fops;
+ int result = -EOPNOTSUPP;
+
if (!ops->rw_page || bdev_get_integrity(bdev))
- return -EOPNOTSUPP;
- return ops->rw_page(bdev, sector + get_start_sect(bdev), page, READ);
+ return result;
+
+ result = blk_queue_enter(bdev->bd_queue, GFP_KERNEL);
+ if (result)
+ return result;
+ result = ops->rw_page(bdev, sector + get_start_sect(bdev), page, READ);
+ blk_queue_exit(bdev->bd_queue);
+ return result;
}
EXPORT_SYMBOL_GPL(bdev_read_page);
@@ -421,14 +429,20 @@ int bdev_write_page(struct block_device *bdev, sector_t sector,
int result;
int rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE;
const struct block_device_operations *ops = bdev->bd_disk->fops;
+
if (!ops->rw_page || bdev_get_integrity(bdev))
return -EOPNOTSUPP;
+ result = blk_queue_enter(bdev->bd_queue, GFP_KERNEL);
+ if (result)
+ return result;
+
set_page_writeback(page);
result = ops->rw_page(bdev, sector + get_start_sect(bdev), page, rw);
if (result)
end_page_writeback(page);
else
unlock_page(page);
+ blk_queue_exit(bdev->bd_queue);
return result;
}
EXPORT_SYMBOL_GPL(bdev_write_page);
diff --git a/fs/dax.c b/fs/dax.c
index d1e5cb7311a1..43671b68220e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -541,6 +541,10 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
unsigned long pfn;
int result = 0;
+ /* dax pmd mappings are broken wrt gup and fork */
+ if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
+ return VM_FAULT_FALLBACK;
+
/* Fall back to PTEs if we're going to COW */
if (write && !(vma->vm_flags & VM_SHARED))
return VM_FAULT_FALLBACK;
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 3a71cea68420..748d35afc902 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -569,6 +569,8 @@ static int parse_options(char *options, struct super_block *sb)
/* Fall through */
case Opt_dax:
#ifdef CONFIG_FS_DAX
+ ext2_msg(sb, KERN_WARNING,
+ "DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
set_opt(sbi->s_mount_opt, DAX);
#else
ext2_msg(sb, KERN_INFO, "dax option not supported");
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 753f4e68b820..c9ab67da6e5a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1664,8 +1664,12 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
}
sbi->s_jquota_fmt = m->mount_opt;
#endif
-#ifndef CONFIG_FS_DAX
} else if (token == Opt_dax) {
+#ifdef CONFIG_FS_DAX
+ ext4_msg(sb, KERN_WARNING,
+ "DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
+ sbi->s_mount_opt |= m->mount_opt;
+#else
ext4_msg(sb, KERN_INFO, "dax option not supported");
return -1;
#endif
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3fe27f8d91f0..c0d2b7927c1f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -794,6 +794,8 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
struct scsi_ioctl_command __user *);
+extern int blk_queue_enter(struct request_queue *q, gfp_t gfp);
+extern void blk_queue_exit(struct request_queue *q);
extern void blk_start_queue(struct request_queue *q);
extern void blk_stop_queue(struct request_queue *q);
extern void blk_sync_queue(struct request_queue *q);
diff --git a/mm/memory.c b/mm/memory.c
index deb679c31f2a..c387430f06c3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3015,9 +3015,9 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
} else {
/*
* The fault handler has no page to lock, so it holds
- * i_mmap_lock for write to protect against truncate.
+ * i_mmap_lock for read to protect against truncate.
*/
- i_mmap_unlock_write(vma->vm_file->f_mapping);
+ i_mmap_unlock_read(vma->vm_file->f_mapping);
}
goto uncharge_out;
}
@@ -3031,9 +3031,9 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
} else {
/*
* The fault handler has no page to lock, so it holds
- * i_mmap_lock for write to protect against truncate.
+ * i_mmap_lock for read to protect against truncate.
*/
- i_mmap_unlock_write(vma->vm_file->f_mapping);
+ i_mmap_unlock_read(vma->vm_file->f_mapping);
}
return ret;
uncharge_out:
5 years, 2 months
[PATCH] block: protect rw_page against device teardown
by Dan Williams
Fix use after free crashes like the following:
general protection fault: 0000 [#1] SMP
Call Trace:
[<ffffffffa0050216>] ? pmem_do_bvec.isra.12+0xa6/0xf0 [nd_pmem]
[<ffffffffa0050ba2>] pmem_rw_page+0x42/0x80 [nd_pmem]
[<ffffffff8128fd90>] bdev_read_page+0x50/0x60
[<ffffffff812972f0>] do_mpage_readpage+0x510/0x770
[<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
[<ffffffff811d86dc>] ? lru_cache_add+0x1c/0x50
[<ffffffff81297657>] mpage_readpages+0x107/0x170
[<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
[<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
[<ffffffff8129058d>] blkdev_readpages+0x1d/0x20
[<ffffffff811d615f>] __do_page_cache_readahead+0x28f/0x310
[<ffffffff811d6039>] ? __do_page_cache_readahead+0x169/0x310
[<ffffffff811c5abd>] ? pagecache_get_page+0x2d/0x1d0
[<ffffffff811c76f6>] filemap_fault+0x396/0x530
[<ffffffff811f816e>] __do_fault+0x4e/0xf0
[<ffffffff811fce7d>] handle_mm_fault+0x11bd/0x1b50
Cc: <stable(a)vger.kernel.org>
Cc: Jens Axboe <axboe(a)fb.com>
Cc: Matthew Wilcox <willy(a)linux.intel.com>
Cc: Alexander Viro <viro(a)zeniv.linux.org.uk>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
fs/block_dev.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index bb0dfb1c7af1..cc0af12acf94 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -390,9 +390,17 @@ int bdev_read_page(struct block_device *bdev, sector_t sector,
struct page *page)
{
const struct block_device_operations *ops = bdev->bd_disk->fops;
+ int rc = -EOPNOTSUPP;
+
if (!ops->rw_page || bdev_get_integrity(bdev))
- return -EOPNOTSUPP;
- return ops->rw_page(bdev, sector + get_start_sect(bdev), page, READ);
+ return rc;
+
+ rc = blk_queue_enter(bdev->bd_queue, GFP_KERNEL);
+ if (rc)
+ return rc;
+ rc = ops->rw_page(bdev, sector + get_start_sect(bdev), page, READ);
+ blk_queue_exit(bdev->bd_queue);
+ return rc;
}
EXPORT_SYMBOL_GPL(bdev_read_page);
@@ -421,14 +429,20 @@ int bdev_write_page(struct block_device *bdev, sector_t sector,
int result;
int rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE;
const struct block_device_operations *ops = bdev->bd_disk->fops;
+
if (!ops->rw_page || bdev_get_integrity(bdev))
return -EOPNOTSUPP;
+ result = blk_queue_enter(bdev->bd_queue, GFP_KERNEL);
+ if (result)
+ return result;
+
set_page_writeback(page);
result = ops->rw_page(bdev, sector + get_start_sect(bdev), page, rw);
if (result)
end_page_writeback(page);
else
unlock_page(page);
+ blk_queue_exit(bdev->bd_queue);
return result;
}
EXPORT_SYMBOL_GPL(bdev_write_page);
5 years, 2 months
[RFC PATCH] restrict /dev/mem to idle io memory ranges
by Dan Williams
This effectively promotes IORESOURCE_BUSY to IORESOURCE_EXCLUSIVE
semantics by default. If userspace really believes it is safe to access
the memory region it can also perform the extra step of disabling an
active driver. This protects device address ranges with read side
effects and otherwise directs userspace to use the driver.
Persistent memory presents a large "mistake surface" to /dev/mem as now
accidental writes can corrupt a filesystem.
Cc: Kees Cook <keescook(a)chromium.org>
Cc: Russell King <linux(a)arm.linux.org.uk>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Will Deacon <will.deacon(a)arm.com>
Cc: Benjamin Herrenschmidt <benh(a)kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
Cc: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
arch/arm/Kconfig.debug | 14 --------------
arch/arm64/Kconfig.debug | 14 --------------
arch/powerpc/Kconfig.debug | 12 ------------
arch/s390/Kconfig.debug | 12 ------------
arch/tile/Kconfig | 3 ---
arch/unicore32/Kconfig.debug | 14 --------------
arch/x86/Kconfig.debug | 17 -----------------
kernel/resource.c | 3 +++
lib/Kconfig.debug | 36 ++++++++++++++++++++++++++++++++++++
9 files changed, 39 insertions(+), 86 deletions(-)
diff --git a/arch/arm/Kconfig.debug b/arch/arm/Kconfig.debug
index 259c0ca9c99a..e356357d86bb 100644
--- a/arch/arm/Kconfig.debug
+++ b/arch/arm/Kconfig.debug
@@ -15,20 +15,6 @@ config ARM_PTDUMP
kernel.
If in doubt, say "N"
-config STRICT_DEVMEM
- bool "Filter access to /dev/mem"
- depends on MMU
- ---help---
- If this option is disabled, you allow userspace (root) access to all
- of memory, including kernel and userspace memory. Accidental
- access to this is obviously disastrous, but specific access can
- be used by people debugging the kernel.
-
- If this option is switched on, the /dev/mem file only allows
- userspace access to memory mapped peripherals.
-
- If in doubt, say Y.
-
# RMK wants arm kernels compiled with frame pointers or stack unwinding.
# If you know what you are doing and are willing to live without stack
# traces, you can get a slightly smaller kernel by setting this option to
diff --git a/arch/arm64/Kconfig.debug b/arch/arm64/Kconfig.debug
index 04fb73b973f1..e13c4bf84d9e 100644
--- a/arch/arm64/Kconfig.debug
+++ b/arch/arm64/Kconfig.debug
@@ -14,20 +14,6 @@ config ARM64_PTDUMP
kernel.
If in doubt, say "N"
-config STRICT_DEVMEM
- bool "Filter access to /dev/mem"
- depends on MMU
- help
- If this option is disabled, you allow userspace (root) access to all
- of memory, including kernel and userspace memory. Accidental
- access to this is obviously disastrous, but specific access can
- be used by people debugging the kernel.
-
- If this option is switched on, the /dev/mem file only allows
- userspace access to memory mapped peripherals.
-
- If in doubt, say Y.
-
config PID_IN_CONTEXTIDR
bool "Write the current PID to the CONTEXTIDR register"
help
diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
index 3a510f4a6b68..a0e44a9c456f 100644
--- a/arch/powerpc/Kconfig.debug
+++ b/arch/powerpc/Kconfig.debug
@@ -335,18 +335,6 @@ config PPC_EARLY_DEBUG_CPM_ADDR
platform probing is done, all platforms selected must
share the same address.
-config STRICT_DEVMEM
- def_bool y
- prompt "Filter access to /dev/mem"
- help
- This option restricts access to /dev/mem. If this option is
- disabled, you allow userspace access to all memory, including
- kernel and userspace memory. Accidental memory access is likely
- to be disastrous.
- Memory access is required for experts who want to debug the kernel.
-
- If you are unsure, say Y.
-
config FAIL_IOMMU
bool "Fault-injection capability for IOMMU"
depends on FAULT_INJECTION
diff --git a/arch/s390/Kconfig.debug b/arch/s390/Kconfig.debug
index c56878e1245f..26c5d5beb4be 100644
--- a/arch/s390/Kconfig.debug
+++ b/arch/s390/Kconfig.debug
@@ -5,18 +5,6 @@ config TRACE_IRQFLAGS_SUPPORT
source "lib/Kconfig.debug"
-config STRICT_DEVMEM
- def_bool y
- prompt "Filter access to /dev/mem"
- ---help---
- This option restricts access to /dev/mem. If this option is
- disabled, you allow userspace access to all memory, including
- kernel and userspace memory. Accidental memory access is likely
- to be disastrous.
- Memory access is required for experts who want to debug the kernel.
-
- If you are unsure, say Y.
-
config S390_PTDUMP
bool "Export kernel pagetable layout to userspace via debugfs"
depends on DEBUG_KERNEL
diff --git a/arch/tile/Kconfig b/arch/tile/Kconfig
index 106c21bd7f44..7b2d40db11fa 100644
--- a/arch/tile/Kconfig
+++ b/arch/tile/Kconfig
@@ -116,9 +116,6 @@ config ARCH_DISCONTIGMEM_DEFAULT
config TRACE_IRQFLAGS_SUPPORT
def_bool y
-config STRICT_DEVMEM
- def_bool y
-
# SMP is required for Tilera Linux.
config SMP
def_bool y
diff --git a/arch/unicore32/Kconfig.debug b/arch/unicore32/Kconfig.debug
index 1a3626239843..f075bbe1d46f 100644
--- a/arch/unicore32/Kconfig.debug
+++ b/arch/unicore32/Kconfig.debug
@@ -2,20 +2,6 @@ menu "Kernel hacking"
source "lib/Kconfig.debug"
-config STRICT_DEVMEM
- bool "Filter access to /dev/mem"
- depends on MMU
- ---help---
- If this option is disabled, you allow userspace (root) access to all
- of memory, including kernel and userspace memory. Accidental
- access to this is obviously disastrous, but specific access can
- be used by people debugging the kernel.
-
- If this option is switched on, the /dev/mem file only allows
- userspace access to memory mapped peripherals.
-
- If in doubt, say Y.
-
config EARLY_PRINTK
def_bool DEBUG_OCD
help
diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 137dfa96aa14..1116452fcfc2 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -5,23 +5,6 @@ config TRACE_IRQFLAGS_SUPPORT
source "lib/Kconfig.debug"
-config STRICT_DEVMEM
- bool "Filter access to /dev/mem"
- ---help---
- If this option is disabled, you allow userspace (root) access to all
- of memory, including kernel and userspace memory. Accidental
- access to this is obviously disastrous, but specific access can
- be used by people debugging the kernel. Note that with PAT support
- enabled, even in this case there are restrictions on /dev/mem
- use due to the cache aliasing requirements.
-
- If this option is switched on, the /dev/mem file only allows
- userspace access to PCI space and the BIOS code and data regions.
- This is sufficient for dosemu and X and all common users of
- /dev/mem.
-
- If in doubt, say Y.
-
config X86_VERBOSE_BOOTUP
bool "Enable verbose x86 bootup info messages"
default y
diff --git a/kernel/resource.c b/kernel/resource.c
index f150dbbe6f62..03a8b09f68a8 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -1498,6 +1498,9 @@ int iomem_is_exclusive(u64 addr)
break;
if (p->end < addr)
continue;
+ if (IS_ENABLED(CONFIG_IO_STRICT_DEVMEM)
+ && p->flags & IORESOURCE_BUSY)
+ break;
if (p->flags & IORESOURCE_BUSY &&
p->flags & IORESOURCE_EXCLUSIVE) {
err = 1;
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 8c15b29d5adc..a188d7757e26 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1853,3 +1853,39 @@ source "samples/Kconfig"
source "lib/Kconfig.kgdb"
+config STRICT_DEVMEM
+ bool "Filter access to /dev/mem"
+ depends on MMU
+ default y if TILE || PPC || S390
+ ---help---
+ If this option is disabled, you allow userspace (root) access to all
+ of memory, including kernel and userspace memory. Accidental
+ access to this is obviously disastrous, but specific access can
+ be used by people debugging the kernel. Note that with PAT support
+ enabled, even in this case there are restrictions on /dev/mem
+ use due to the cache aliasing requirements.
+
+ If this option is switched on, the /dev/mem file only allows
+ userspace access to PCI space and the BIOS code and data regions.
+ This is sufficient for dosemu and X and all common users of
+ /dev/mem.
+
+ If in doubt, say Y.
+
+config IO_STRICT_DEVMEM
+ bool "Filter I/O access to /dev/mem"
+ depends on STRICT_DEVMEM
+ ---help---
+ If this option is disabled, you allow userspace (root) access
+ to all io memory regardless of whether a driver is actively
+ using that range. Accidental access to this is obviously
+ disastrous, but specific access can be used by people
+ debugging the kernel.
+
+ If this option is switched on, the /dev/mem file only allows
+ userspace access to *idle* io memory ranges (any non "System
+ RAM" range listed in /proc/iomem). This may break
+ traditional users of /dev/mem if the driver using a given
+ range cannot be disabled.
+
+ If in doubt, say N.
5 years, 2 months
[PATCH 0/8] dax fixes / cleanups: pmd vs thp, lifetime, and locking
by Dan Williams
Changes since last posting [1]:
1/ Further cleanups to dax_clear_blocks(): Dropped increments of 'addr'
since we call bdev_direct_access() before the next use, and dropped the
BUG_ON for sector unaligned return values from bdev_direct_access().
2/ In [PATCH 8/8] introduce blk_dax_ctl to remove the need to have
separate dax_map_atomic and __dax_map_atomic routines. Note,
blk_dax_ctl is not passed through to drivers, it gets unpacked in
bdev_direct_access. (Willy)
3/ New [PATCH 2/8]: Disable huge page dax mappings while we resolve
various crash scenarios in this development cycle.
4/ New [PATCH 4/8]: Unmap all dax mappings at block device shutdown
I have kept the reviewed-by's received to date, let me know if these
incremental updates invalidate that review.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-November/002733.html
---
The first 4 patches in this series I consider 4.4-rc / -stable material.
The rest are for 4.5. [PATCH 4/8] needs scrutiny. It is yet another
example of where DAX behavior necessarily differs from page cache
behavior. I still maintain that we should not be surprising unaware
applications with DAX semantics, i.e. that DAX should be per-inode
opt-in, not globally enabled for all inodes at fs mount time.
The largest patch in the set, [PATCH 8/8], addresses the lifetime of the
'addr' returned by bdev_direct_access. That address is only valid while
the device driver is enabled. The new dax_map_atomic() /
dax_unmap_atomic() pairing guarantees that 'addr' stays valid for the
duration of that mapping.
While dax_map_atomic() protects against 'addr' going invalid, the new
calls to truncate_pagecache() via invalidate_inodes() protect against
the 'pfn' returned from bdev_direct_access() going invalid. Otherwise,
the storage media can be directly accessed after the driver has been
disabled.
---
[PATCH 1/8] ext2, ext4: warn when mounting with dax enabled
[PATCH 2/8] dax: disable pmd mappings
[PATCH 3/8] mm, dax: fix DAX deadlocks (COW fault)
[PATCH 4/8] mm, dax: truncate dax mappings at bdev or fs shutdown
[PATCH 5/8] pmem, dax: clean up clear_pmem()
[PATCH 6/8] dax: increase granularity of dax_clear_blocks() operations
[PATCH 7/8] dax: guarantee page aligned results from bdev_direct_access()
[PATCH 8/8] dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()
arch/x86/include/asm/pmem.h | 7 -
block/blk.h | 2
fs/Kconfig | 6 +
fs/block_dev.c | 15 +--
fs/dax.c | 228 ++++++++++++++++++++++++++-----------------
fs/ext2/super.c | 2
fs/ext4/super.c | 6 +
fs/inode.c | 27 +++++
include/linux/blkdev.h | 19 +++-
mm/memory.c | 8 +-
mm/truncate.c | 13 ++
11 files changed, 217 insertions(+), 116 deletions(-)
5 years, 2 months
[PATCH v2 00/11] DAX fsynx/msync support
by Ross Zwisler
This patch series adds support for fsync/msync to DAX.
Patches 1 through 7 add various utilities that the DAX code will eventually
need, and the DAX code itself is added by patch 8. Patches 9-11 update the
three filesystems that currently support DAX, ext2, ext4 and XFS, to use
the new DAX fsync/msync code.
These patches build on the recent DAX locking changes from Dave Chinner,
Jan Kara and myself. Dave's changes for XFS and my changes for ext2 have
been merged in the v4.4 window, but Jan's are still unmerged. You can grab
them here:
http://www.spinics.net/lists/linux-ext4/msg49951.html
Ross Zwisler (11):
pmem: add wb_cache_pmem() to the PMEM API
mm: add pmd_mkclean()
pmem: enable REQ_FUA/REQ_FLUSH handling
dax: support dirty DAX entries in radix tree
mm: add follow_pte_pmd()
mm: add pgoff_mkclean()
mm: add find_get_entries_tag()
dax: add support for fsync/sync
ext2: add support for DAX fsync/msync
ext4: add support for DAX fsync/msync
xfs: add support for DAX fsync/msync
arch/x86/include/asm/pgtable.h | 5 ++
arch/x86/include/asm/pmem.h | 11 ++--
drivers/nvdimm/pmem.c | 3 +-
fs/block_dev.c | 3 +-
fs/dax.c | 140 +++++++++++++++++++++++++++++++++++++++--
fs/ext2/file.c | 14 ++++-
fs/ext4/file.c | 4 +-
fs/ext4/fsync.c | 12 +++-
fs/inode.c | 1 +
fs/xfs/xfs_file.c | 18 ++++--
include/linux/dax.h | 6 ++
include/linux/fs.h | 1 +
include/linux/mm.h | 2 +
include/linux/pagemap.h | 3 +
include/linux/pmem.h | 22 ++++++-
include/linux/radix-tree.h | 8 +++
include/linux/rmap.h | 5 ++
mm/filemap.c | 71 ++++++++++++++++++++-
mm/huge_memory.c | 14 ++---
mm/memory.c | 38 ++++++++---
mm/rmap.c | 51 +++++++++++++++
mm/truncate.c | 62 ++++++++++--------
22 files changed, 425 insertions(+), 69 deletions(-)
--
2.1.0
5 years, 2 months
dax pmd fault handler never returns to userspace
by Jeff Moyer
Hi,
When running the nvml library's test suite against an ext4 file system
mounted with -o dax, I ran into an issue where many of the tests would
simply timeout. The problem appears to be that the pmd fault handler
never returns to userspace (the application is doing a memcpy of 512
bytes into pmem). Here's the 'perf report -g' output:
- 88.30% 0.01% blk_non_zero.st libc-2.17.so [.] __memmove_ssse3_back
- 88.30% __memmove_ssse3_back
- 66.63% page_fault
- 66.47% do_page_fault
- 66.16% __do_page_fault
- 63.38% handle_mm_fault
- 61.15% ext4_dax_pmd_fault
- 45.04% __dax_pmd_fault
- 37.05% vmf_insert_pfn_pmd
- track_pfn_insert
- 35.58% lookup_memtype
- 33.80% pat_pagerange_is_ram
- 33.40% walk_system_ram_range
- 31.63% find_next_iomem_res
21.78% strcmp
And here's 'perf top':
Samples: 2M of event 'cycles:pp', Event count (approx.): 56080150519
Overhead Shared Object Symbol
22.55% [kernel] [k] strcmp
20.33% [unknown] [k] 0x00007f9f549ef3f3
10.01% [kernel] [k] native_irq_return_iret
9.54% [kernel] [k] find_next_iomem_res
3.00% [jbd2] [k] start_this_handle
This is easily reproduced by doing the following:
git clone https://github.com/pmem/nvml.git
cd nvml
make
make test
cd src/test/blk_non_zero
./blk_non_zero.static-nondebug 512 /path/to/ext4/dax/fs/testfile1 c 1073741824 w:0
I also ran the test suite against xfs, and the problem is not present
there. However, I did not verify that the xfs tests were getting pmd
faults.
I'm happy to help diagnose the problem further, if necessary.
Cheers,
Jeff
5 years, 2 months
[RFC PATCH] block: introduce poison tracking for block devices
by Vishal Verma
This patch copies the badblock management code from md-raid to use it
for tracking bad/'poison' sectors on a per-block device level.
NVDIMM devices, which behave more like DRAM, may develop bad cache
lines, or 'poison'. A block device exposed by the pmem driver can
then consume poison via a read (or write), and cause a machine check.
On platforms without machine check recovery features, this would
mean a crash.
The block device maintaining a runtime list of all known poison can
directly avoid this, and also provide a path forward to enable proper
handling/recovery for DAX faults on such a device.
Signed-off-by: Vishal Verma <vishal.l.verma(a)intel.com>
---
This really is a copy-paste + a few modifications of the badblock management
code + sysfs representation from md.
In this RFC, I want to make sure this path sounds acceptable for the use case
described above, for NVDIMMs. Eventually, I think the md badblock management
and this should be refactored to use the same code - I think this should be
easy to do:
- move the badblocks struct and associated functions into a header file (along
the lines of include/linux/list.h)
- embed the structure into whatever needs to use this list (in case of md, this
would be 'rdev', in the nvdimm case, the gendisk)
- call the functions from badblocks.h as needed to manipulate the list.
- The sysfs show/store functions in badblocks.h would be generic variants, with
wrappers being present in md and gendisk to fit into their respective sysfs
layouts
If this looks generally reasonable, I'll post a v2 with this refactoring done.
block/genhd.c | 502 ++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/genhd.h | 26 +++
2 files changed, 528 insertions(+)
diff --git a/block/genhd.c b/block/genhd.c
index 0c706f3..de99d28 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -23,6 +23,15 @@
#include "blk.h"
+#define BB_LEN_MASK (0x00000000000001FFULL)
+#define BB_OFFSET_MASK (0x7FFFFFFFFFFFFE00ULL)
+#define BB_ACK_MASK (0x8000000000000000ULL)
+#define BB_MAX_LEN 512
+#define BB_OFFSET(x) (((x) & BB_OFFSET_MASK) >> 9)
+#define BB_LEN(x) (((x) & BB_LEN_MASK) + 1)
+#define BB_ACK(x) (!!((x) & BB_ACK_MASK))
+#define BB_MAKE(a, l, ack) (((a)<<9) | ((l)-1) | ((u64)(!!(ack)) << 63))
+
static DEFINE_MUTEX(block_class_lock);
struct kobject *block_depr;
@@ -670,6 +679,496 @@ void del_gendisk(struct gendisk *disk)
}
EXPORT_SYMBOL(del_gendisk);
+int disk_poison_list_init(struct gendisk *disk)
+{
+ disk->plist = kmalloc(sizeof(*disk->plist), GFP_KERNEL);
+ if (!disk->plist)
+ return -ENOMEM;
+ disk->plist->count = 0;
+ disk->plist->shift = 0;
+ disk->plist->page = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ seqlock_init(&disk->plist->lock);
+ if (disk->plist->page == NULL)
+ return -ENOMEM;
+
+ return 0;
+}
+EXPORT_SYMBOL(disk_poison_list_init);
+
+/* Bad block management.
+ * We can record which blocks on each device are 'bad' and so just
+ * fail those blocks, or that stripe, rather than the whole device.
+ * Entries in the bad-block table are 64bits wide. This comprises:
+ * Length of bad-range, in sectors: 0-511 for lengths 1-512
+ * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
+ * A 'shift' can be set so that larger blocks are tracked and
+ * consequently larger devices can be covered.
+ * 'Acknowledged' flag - 1 bit. - the most significant bit.
+ *
+ * Locking of the bad-block table uses a seqlock so md_is_badblock
+ * might need to retry if it is very unlucky.
+ * We will sometimes want to check for bad blocks in a bi_end_io function,
+ * so we use the write_seqlock_irq variant.
+ *
+ * When looking for a bad block we specify a range and want to
+ * know if any block in the range is bad. So we binary-search
+ * to the last range that starts at-or-before the given endpoint,
+ * (or "before the sector after the target range")
+ * then see if it ends after the given start.
+ * We return
+ * 0 if there are no known bad blocks in the range
+ * 1 if there are known bad block which are all acknowledged
+ * -1 if there are bad blocks which have not yet been acknowledged in metadata.
+ * plus the start/length of the first bad section we overlap.
+ */
+int disk_check_poison(struct gendisk *disk, sector_t s, int sectors,
+ sector_t *first_bad, int *bad_sectors)
+{
+ struct disk_poison *bb = disk->plist;
+ int hi;
+ int lo;
+ u64 *p = bb->page;
+ int rv;
+ sector_t target = s + sectors;
+ unsigned seq;
+
+ if (bb->shift > 0) {
+ /* round the start down, and the end up */
+ s >>= bb->shift;
+ target += (1<<bb->shift) - 1;
+ target >>= bb->shift;
+ sectors = target - s;
+ }
+ /* 'target' is now the first block after the bad range */
+
+retry:
+ seq = read_seqbegin(&bb->lock);
+ lo = 0;
+ rv = 0;
+ hi = bb->count;
+
+ /* Binary search between lo and hi for 'target'
+ * i.e. for the last range that starts before 'target'
+ */
+ /* INVARIANT: ranges before 'lo' and at-or-after 'hi'
+ * are known not to be the last range before target.
+ * VARIANT: hi-lo is the number of possible
+ * ranges, and decreases until it reaches 1
+ */
+ while (hi - lo > 1) {
+ int mid = (lo + hi) / 2;
+ sector_t a = BB_OFFSET(p[mid]);
+ if (a < target)
+ /* This could still be the one, earlier ranges
+ * could not. */
+ lo = mid;
+ else
+ /* This and later ranges are definitely out. */
+ hi = mid;
+ }
+ /* 'lo' might be the last that started before target, but 'hi' isn't */
+ if (hi > lo) {
+ /* need to check all range that end after 's' to see if
+ * any are unacknowledged.
+ */
+ while (lo >= 0 &&
+ BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) {
+ if (BB_OFFSET(p[lo]) < target) {
+ /* starts before the end, and finishes after
+ * the start, so they must overlap
+ */
+ if (rv != -1 && BB_ACK(p[lo]))
+ rv = 1;
+ else
+ rv = -1;
+ *first_bad = BB_OFFSET(p[lo]);
+ *bad_sectors = BB_LEN(p[lo]);
+ }
+ lo--;
+ }
+ }
+
+ if (read_seqretry(&bb->lock, seq))
+ goto retry;
+
+ return rv;
+}
+EXPORT_SYMBOL_GPL(disk_check_poison);
+
+/*
+ * Add a range of bad blocks to the table.
+ * This might extend the table, or might contract it
+ * if two adjacent ranges can be merged.
+ * We binary-search to find the 'insertion' point, then
+ * decide how best to handle it.
+ */
+int disk_add_poison(struct gendisk *disk, sector_t s, int sectors,
+ int acknowledged)
+{
+ struct disk_poison *bb = disk->plist;
+ u64 *p;
+ int lo, hi;
+ int rv = 1;
+ unsigned long flags;
+
+ if (bb->shift < 0)
+ /* badblocks are disabled */
+ return 0;
+
+ if (bb->shift) {
+ /* round the start down, and the end up */
+ sector_t next = s + sectors;
+ s >>= bb->shift;
+ next += (1<<bb->shift) - 1;
+ next >>= bb->shift;
+ sectors = next - s;
+ }
+
+ write_seqlock_irqsave(&bb->lock, flags);
+
+ p = bb->page;
+ lo = 0;
+ hi = bb->count;
+ /* Find the last range that starts at-or-before 's' */
+ while (hi - lo > 1) {
+ int mid = (lo + hi) / 2;
+ sector_t a = BB_OFFSET(p[mid]);
+ if (a <= s)
+ lo = mid;
+ else
+ hi = mid;
+ }
+ if (hi > lo && BB_OFFSET(p[lo]) > s)
+ hi = lo;
+
+ if (hi > lo) {
+ /* we found a range that might merge with the start
+ * of our new range
+ */
+ sector_t a = BB_OFFSET(p[lo]);
+ sector_t e = a + BB_LEN(p[lo]);
+ int ack = BB_ACK(p[lo]);
+ if (e >= s) {
+ /* Yes, we can merge with a previous range */
+ if (s == a && s + sectors >= e)
+ /* new range covers old */
+ ack = acknowledged;
+ else
+ ack = ack && acknowledged;
+
+ if (e < s + sectors)
+ e = s + sectors;
+ if (e - a <= BB_MAX_LEN) {
+ p[lo] = BB_MAKE(a, e-a, ack);
+ s = e;
+ } else {
+ /* does not all fit in one range,
+ * make p[lo] maximal
+ */
+ if (BB_LEN(p[lo]) != BB_MAX_LEN)
+ p[lo] = BB_MAKE(a, BB_MAX_LEN, ack);
+ s = a + BB_MAX_LEN;
+ }
+ sectors = e - s;
+ }
+ }
+ if (sectors && hi < bb->count) {
+ /* 'hi' points to the first range that starts after 's'.
+ * Maybe we can merge with the start of that range */
+ sector_t a = BB_OFFSET(p[hi]);
+ sector_t e = a + BB_LEN(p[hi]);
+ int ack = BB_ACK(p[hi]);
+ if (a <= s + sectors) {
+ /* merging is possible */
+ if (e <= s + sectors) {
+ /* full overlap */
+ e = s + sectors;
+ ack = acknowledged;
+ } else
+ ack = ack && acknowledged;
+
+ a = s;
+ if (e - a <= BB_MAX_LEN) {
+ p[hi] = BB_MAKE(a, e-a, ack);
+ s = e;
+ } else {
+ p[hi] = BB_MAKE(a, BB_MAX_LEN, ack);
+ s = a + BB_MAX_LEN;
+ }
+ sectors = e - s;
+ lo = hi;
+ hi++;
+ }
+ }
+ if (sectors == 0 && hi < bb->count) {
+ /* we might be able to combine lo and hi */
+ /* Note: 's' is at the end of 'lo' */
+ sector_t a = BB_OFFSET(p[hi]);
+ int lolen = BB_LEN(p[lo]);
+ int hilen = BB_LEN(p[hi]);
+ int newlen = lolen + hilen - (s - a);
+ if (s >= a && newlen < BB_MAX_LEN) {
+ /* yes, we can combine them */
+ int ack = BB_ACK(p[lo]) && BB_ACK(p[hi]);
+ p[lo] = BB_MAKE(BB_OFFSET(p[lo]), newlen, ack);
+ memmove(p + hi, p + hi + 1,
+ (bb->count - hi - 1) * 8);
+ bb->count--;
+ }
+ }
+ while (sectors) {
+ /* didn't merge (it all).
+ * Need to add a range just before 'hi' */
+ if (bb->count >= DISK_MAX_POISON) {
+ /* No room for more */
+ rv = 0;
+ break;
+ } else {
+ int this_sectors = sectors;
+ memmove(p + hi + 1, p + hi,
+ (bb->count - hi) * 8);
+ bb->count++;
+
+ if (this_sectors > BB_MAX_LEN)
+ this_sectors = BB_MAX_LEN;
+ p[hi] = BB_MAKE(s, this_sectors, acknowledged);
+ sectors -= this_sectors;
+ s += this_sectors;
+ }
+ }
+
+ bb->changed = 1;
+ if (!acknowledged)
+ bb->unacked_exist = 1;
+ write_sequnlock_irqrestore(&bb->lock, flags);
+
+ /* Make sure they get written out promptly */
+ /* TODO sysfs_notify_dirent_safe(disk->sysfs_state); */
+
+ return rv;
+}
+EXPORT_SYMBOL_GPL(disk_add_poison);
+
+/*
+ * Remove a range of bad blocks from the table.
+ * This may involve extending the table if we spilt a region,
+ * but it must not fail. So if the table becomes full, we just
+ * drop the remove request.
+ */
+int disk_clear_poison(struct gendisk *disk, sector_t s, int sectors)
+{
+ struct disk_poison *bb = disk->plist;
+ u64 *p;
+ int lo, hi;
+ sector_t target = s + sectors;
+ int rv = 0;
+
+ if (bb->shift > 0) {
+ /* When clearing we round the start up and the end down.
+ * This should not matter as the shift should align with
+ * the block size and no rounding should ever be needed.
+ * However it is better the think a block is bad when it
+ * isn't than to think a block is not bad when it is.
+ */
+ s += (1<<bb->shift) - 1;
+ s >>= bb->shift;
+ target >>= bb->shift;
+ sectors = target - s;
+ }
+
+ write_seqlock_irq(&bb->lock);
+
+ p = bb->page;
+ lo = 0;
+ hi = bb->count;
+ /* Find the last range that starts before 'target' */
+ while (hi - lo > 1) {
+ int mid = (lo + hi) / 2;
+ sector_t a = BB_OFFSET(p[mid]);
+ if (a < target)
+ lo = mid;
+ else
+ hi = mid;
+ }
+ if (hi > lo) {
+ /* p[lo] is the last range that could overlap the
+ * current range. Earlier ranges could also overlap,
+ * but only this one can overlap the end of the range.
+ */
+ if (BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > target) {
+ /* Partial overlap, leave the tail of this range */
+ int ack = BB_ACK(p[lo]);
+ sector_t a = BB_OFFSET(p[lo]);
+ sector_t end = a + BB_LEN(p[lo]);
+
+ if (a < s) {
+ /* we need to split this range */
+ if (bb->count >= DISK_MAX_POISON) {
+ rv = -ENOSPC;
+ goto out;
+ }
+ memmove(p+lo+1, p+lo, (bb->count - lo) * 8);
+ bb->count++;
+ p[lo] = BB_MAKE(a, s-a, ack);
+ lo++;
+ }
+ p[lo] = BB_MAKE(target, end - target, ack);
+ /* there is no longer an overlap */
+ hi = lo;
+ lo--;
+ }
+ while (lo >= 0 &&
+ BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) {
+ /* This range does overlap */
+ if (BB_OFFSET(p[lo]) < s) {
+ /* Keep the early parts of this range. */
+ int ack = BB_ACK(p[lo]);
+ sector_t start = BB_OFFSET(p[lo]);
+ p[lo] = BB_MAKE(start, s - start, ack);
+ /* now low doesn't overlap, so.. */
+ break;
+ }
+ lo--;
+ }
+ /* 'lo' is strictly before, 'hi' is strictly after,
+ * anything between needs to be discarded
+ */
+ if (hi - lo > 1) {
+ memmove(p+lo+1, p+hi, (bb->count - hi) * 8);
+ bb->count -= (hi - lo - 1);
+ }
+ }
+
+ bb->changed = 1;
+out:
+ write_sequnlock_irq(&bb->lock);
+ return rv;
+}
+EXPORT_SYMBOL_GPL(disk_clear_poison);
+
+/*
+ * Acknowledge all bad blocks in a list.
+ * This only succeeds if ->changed is clear. It is used by
+ * in-kernel metadata updates
+ */
+void disk_ack_all_poison(struct gendisk *disk)
+{
+ struct disk_poison *bb = disk->plist;
+
+ if (bb->page == NULL || bb->changed)
+ /* no point even trying */
+ return;
+ write_seqlock_irq(&bb->lock);
+
+ if (bb->changed == 0 && bb->unacked_exist) {
+ u64 *p = bb->page;
+ int i;
+ for (i = 0; i < bb->count ; i++) {
+ if (!BB_ACK(p[i])) {
+ sector_t start = BB_OFFSET(p[i]);
+ int len = BB_LEN(p[i]);
+ p[i] = BB_MAKE(start, len, 1);
+ }
+ }
+ bb->unacked_exist = 0;
+ }
+ write_sequnlock_irq(&bb->lock);
+}
+EXPORT_SYMBOL_GPL(disk_ack_all_poison);
+
+/* sysfs access to bad-blocks list.
+ * We present two files.
+ * 'bad-blocks' lists sector numbers and lengths of ranges that
+ * are recorded as bad. The list is truncated to fit within
+ * the one-page limit of sysfs.
+ * Writing "sector length" to this file adds an acknowledged
+ * bad block list.
+ * 'unacknowledged-bad-blocks' lists bad blocks that have not yet
+ * been acknowledged. Writing to this file adds bad blocks
+ * without acknowledging them. This is largely for testing.
+ */
+
+static ssize_t poison_list_show(struct device *dev,
+ struct device_attribute *attr,
+ char *page)
+{
+ struct gendisk *disk = dev_to_disk(dev);
+ struct disk_poison *bb = disk->plist;
+ size_t len;
+ int i;
+ u64 *p = bb->page;
+ unsigned seq;
+
+ if (bb->shift < 0)
+ return 0;
+
+retry:
+ seq = read_seqbegin(&bb->lock);
+
+ len = 0;
+ i = 0;
+
+ while (len < PAGE_SIZE && i < bb->count) {
+ sector_t s = BB_OFFSET(p[i]);
+ unsigned int length = BB_LEN(p[i]);
+
+ i++;
+ len += snprintf(page+len, PAGE_SIZE-len, "%llu %u\n",
+ (unsigned long long)s << bb->shift,
+ length << bb->shift);
+ }
+
+ if (read_seqretry(&bb->lock, seq))
+ goto retry;
+
+ return len;
+}
+
+#define DO_DEBUG 1
+
+static ssize_t poison_list_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *page, size_t len)
+{
+ struct gendisk *disk = dev_to_disk(dev);
+ unsigned long long sector;
+ int length;
+ char newline;
+#ifdef DO_DEBUG
+ /* Allow clearing via sysfs *only* for testing/debugging.
+ * Normally only a successful write may clear a badblock
+ */
+ int clear = 0;
+ if (page[0] == '-') {
+ clear = 1;
+ page++;
+ }
+#endif /* DO_DEBUG */
+
+ switch (sscanf(page, "%llu %d%c", §or, &length, &newline)) {
+ case 3:
+ if (newline != '\n')
+ return -EINVAL;
+ case 2:
+ if (length <= 0)
+ return -EINVAL;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+#ifdef DO_DEBUG
+ if (clear) {
+ disk_clear_poison(disk, sector, length);
+ return len;
+ }
+#endif /* DO_DEBUG */
+ if (disk_add_poison(disk, sector, length, 1))
+ return len;
+ else
+ return -ENOSPC;
+}
+
/**
* get_gendisk - get partitioning information for a given device
* @devt: device to get partitioning information for
@@ -988,6 +1487,8 @@ static DEVICE_ATTR(discard_alignment, S_IRUGO, disk_discard_alignment_show,
static DEVICE_ATTR(capability, S_IRUGO, disk_capability_show, NULL);
static DEVICE_ATTR(stat, S_IRUGO, part_stat_show, NULL);
static DEVICE_ATTR(inflight, S_IRUGO, part_inflight_show, NULL);
+static DEVICE_ATTR(poison_list, S_IRUGO | S_IWUSR, poison_list_show,
+ poison_list_store);
#ifdef CONFIG_FAIL_MAKE_REQUEST
static struct device_attribute dev_attr_fail =
__ATTR(make-it-fail, S_IRUGO|S_IWUSR, part_fail_show, part_fail_store);
@@ -1009,6 +1510,7 @@ static struct attribute *disk_attrs[] = {
&dev_attr_capability.attr,
&dev_attr_stat.attr,
&dev_attr_inflight.attr,
+ &dev_attr_poison_list.attr,
#ifdef CONFIG_FAIL_MAKE_REQUEST
&dev_attr_fail.attr,
#endif
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 2adbfa6..9acfe1b 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -163,6 +163,24 @@ struct disk_part_tbl {
struct disk_events;
+#define DISK_MAX_POISON (PAGE_SIZE/8)
+
+struct disk_poison {
+ int count; /* count of bad blocks */
+ int unacked_exist; /* there probably are unacknowledged
+ * bad blocks. This is only cleared
+ * when a read discovers none
+ */
+ int shift; /* shift from sectors to block size
+ * a -ve shift means badblocks are
+ * disabled.*/
+ u64 *page; /* badblock list */
+ int changed;
+ seqlock_t lock;
+ sector_t sector;
+ sector_t size; /* in sectors */
+};
+
struct gendisk {
/* major, first_minor and minors are input parameters only,
* don't use directly. Use disk_devt() and disk_max_parts().
@@ -201,6 +219,7 @@ struct gendisk {
struct blk_integrity *integrity;
#endif
int node_id;
+ struct disk_poison *plist;
};
static inline struct gendisk *part_to_disk(struct hd_struct *part)
@@ -434,6 +453,13 @@ extern void disk_block_events(struct gendisk *disk);
extern void disk_unblock_events(struct gendisk *disk);
extern void disk_flush_events(struct gendisk *disk, unsigned int mask);
extern unsigned int disk_clear_events(struct gendisk *disk, unsigned int mask);
+extern int disk_poison_list_init(struct gendisk *disk);
+extern int disk_check_poison(struct gendisk *disk, sector_t s, int sectors,
+ sector_t *first_bad, int *bad_sectors);
+extern int disk_add_poison(struct gendisk *disk, sector_t s, int sectors,
+ int acknowledged);
+extern int disk_clear_poison(struct gendisk *disk, sector_t s, int sectors);
+extern void disk_ack_all_poison(struct gendisk *disk);
/* drivers/char/random.c */
extern void add_disk_randomness(struct gendisk *disk);
--
2.5.0
5 years, 2 months
[RFC PATCH] Fix _FIT vs. NFIT processing breakage
by Linda Knippers
Since commit 209851649dc4f7900a6bfe1de5e2640ab2c7d931, we no longer
see NVDIMM devices on our systems. The NFIT/_FIT processing at
initialization gets a table from _FIT but doesn't like it.
When support for _FIT was added, the code presumed that the data
returned by the _FIT method is identical to the NFIT table, which
starts with an acpi_table_header. However, the _FIT is defined
to return a data in the format of a series of NFIT type structure
entries and as a method, has an acpi_object header rather tahn
an acpi_table_header.
To address the differences, explicitly save the acpi_table_header
from the NFIT, since it is accessible through /sys, and change
the nfit pointer in the acpi_desc structure to point to the
table entries rather than the headers.
This is an RFC patch for several reasons.
1) I've only tested the boot path, not the code path gets
gets a _FIT later.
2) There is some debug information that we probably don't
want to keep in there.
3) I'm not even sure we should be checking _FIT at boot time
4) While this fixes my platform, it probably breaks the tests
that were used to test the original commit.
If we need to have a long discussion about whether our firmware
is correct, then perhaps we can remove the _FIT code from acpi_nfit_add()
while we sort it out.
Reported-by: Jeff Moyer (jmoyer(a)redhat.com>
Signed-off-by: Linda Knippers <linda.knippers(a)hp.com>
---
drivers/acpi/nfit.c | 55 +++++++++++++++++++++++++++++++++++++++++------------
drivers/acpi/nfit.h | 3 ++-
2 files changed, 45 insertions(+), 13 deletions(-)
diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index f7dab53..ad95113 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -655,7 +655,7 @@ static ssize_t revision_show(struct device *dev,
struct nvdimm_bus_descriptor *nd_desc = to_nd_desc(nvdimm_bus);
struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc);
- return sprintf(buf, "%d\n", acpi_desc->nfit->header.revision);
+ return sprintf(buf, "%d\n", acpi_desc->acpi_header.revision);
}
static DEVICE_ATTR_RO(revision);
@@ -1652,7 +1652,6 @@ int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
data = (u8 *) acpi_desc->nfit;
end = data + sz;
- data += sizeof(struct acpi_table_nfit);
while (!IS_ERR_OR_NULL(data))
data = add_table(acpi_desc, &prev, data, end);
@@ -1748,13 +1747,34 @@ static int acpi_nfit_add(struct acpi_device *adev)
return PTR_ERR(acpi_desc);
}
- acpi_desc->nfit = (struct acpi_table_nfit *) tbl;
+ /*
+ * Save the acpi header for later and then skip it, make
+ * nfit point to the first nfit table header.
+ */
+ acpi_desc->acpi_header = *tbl;
+ acpi_desc->nfit = (void *) tbl + sizeof(struct acpi_table_nfit);
+ sz -= sizeof(struct acpi_table_nfit);
/* Evaluate _FIT and override with that if present */
status = acpi_evaluate_object(adev->handle, "_FIT", NULL, &buf);
if (ACPI_SUCCESS(status) && buf.length > 0) {
- acpi_desc->nfit = (struct acpi_table_nfit *)buf.pointer;
- sz = buf.length;
+ union acpi_object *obj;
+
+ dev_dbg(dev, "%s _FIT ptr %p, length: %d\n",
+ __func__, buf.pointer, (int)buf.length);
+ print_hex_dump_debug("_FIT: ", DUMP_PREFIX_OFFSET, 16, 1,
+ buf.pointer, buf.length, true);
+
+ /*
+ * Adjust for the acpi_object header of the _FIT
+ */
+ obj = buf.pointer;
+ if (obj->type == ACPI_TYPE_BUFFER) {
+ acpi_desc->nfit = (struct acpi_nfit_header *)obj->buffer.pointer;
+ sz = obj->buffer.length;
+ } else
+ dev_dbg(dev, "%s invalid type %d, ignoring _FIT\n",
+ __func__, (int) obj->type);
}
rc = acpi_nfit_init(acpi_desc, sz);
@@ -1777,8 +1796,9 @@ static void acpi_nfit_notify(struct acpi_device *adev, u32 event)
{
struct acpi_nfit_desc *acpi_desc = dev_get_drvdata(&adev->dev);
struct acpi_buffer buf = { ACPI_ALLOCATE_BUFFER, NULL };
- struct acpi_table_nfit *nfit_saved;
+ struct acpi_nfit_header *nfit_saved;
struct device *dev = &adev->dev;
+ union acpi_object *obj;
acpi_status status;
int ret;
@@ -1807,13 +1827,24 @@ static void acpi_nfit_notify(struct acpi_device *adev, u32 event)
goto out_unlock;
}
+ dev_dbg(dev, "%s _FIT ptr %p, length: %d\n",
+ __func__, buf.pointer, (int)buf.length);
+ print_hex_dump_debug("_FIT: ", DUMP_PREFIX_OFFSET, 16, 1,
+ buf.pointer, buf.length, true);
+
nfit_saved = acpi_desc->nfit;
- acpi_desc->nfit = (struct acpi_table_nfit *)buf.pointer;
- ret = acpi_nfit_init(acpi_desc, buf.length);
- if (!ret) {
- /* Merge failed, restore old nfit, and exit */
- acpi_desc->nfit = nfit_saved;
- dev_err(dev, "failed to merge updated NFIT\n");
+ obj = buf.pointer;
+ if (obj->type == ACPI_TYPE_BUFFER) {
+ acpi_desc->nfit = (struct acpi_nfit_header *)obj->buffer.pointer;
+ ret = acpi_nfit_init(acpi_desc, obj->buffer.length);
+ if (!ret) {
+ /* Merge failed, restore old nfit, and exit */
+ acpi_desc->nfit = nfit_saved;
+ dev_err(dev, "failed to merge updated NFIT\n");
+ }
+ } else {
+ /* Bad _FIT, restore old nfit */
+ dev_err(dev, "Invalid _FIT\n");
}
kfree(buf.pointer);
diff --git a/drivers/acpi/nfit.h b/drivers/acpi/nfit.h
index 2ea5c07..3d549a3 100644
--- a/drivers/acpi/nfit.h
+++ b/drivers/acpi/nfit.h
@@ -96,7 +96,8 @@ struct nfit_mem {
struct acpi_nfit_desc {
struct nvdimm_bus_descriptor nd_desc;
- struct acpi_table_nfit *nfit;
+ struct acpi_table_header acpi_header;
+ struct acpi_nfit_header *nfit;
struct mutex spa_map_mutex;
struct mutex init_mutex;
struct list_head spa_maps;
--
1.8.3.1
5 years, 2 months
Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
by Dan Williams
On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger(a)dilger.ca> wrote:
> On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams(a)intel.com> wrote:
>>
>> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
>> <ross.zwisler(a)linux.intel.com> wrote:
>>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios. These
>>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
>>> and are used by filesystems to order their metadata, among other things.
>>>
>>> When we get an msync() or fsync() it is the responsibility of the DAX code
>>> to flush all dirty pages to media. The PMEM driver then just has issue a
>>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
>>> the flushed data has been durably stored on the media.
>>>
>>> Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
>>
>> Hmm, I'm not seeing why we need this patch. If the actual flushing of
>> the cache is done by the core why does the driver need support
>> REQ_FLUSH? Especially since it's just a couple instructions. REQ_FUA
>> only makes sense if individual writes can bypass the "drive" cache,
>> but no I/O submitted to the driver proper is ever cached we always
>> flush it through to media.
>
> If the upper level filesystem gets an error when submitting a flush
> request, then it assumes the underlying hardware is broken and cannot
> be as aggressive in IO submission, but instead has to wait for in-flight
> IO to complete.
Upper level filesystems won't get errors when the driver does not
support flush. Those requests are ended cleanly in
generic_make_request_checks(). Yes, the fs still needs to wait for
outstanding I/O to complete but in the case of pmem all I/O is
synchronous. There's never anything to await when flushing at the
pmem driver level.
> Since FUA/FLUSH is basically a no-op for pmem devices,
> it doesn't make sense _not_ to support this functionality.
Seems to be a nop either way. Given that DAX may lead to dirty data
pending to the device in the cpu cache that a REQ_FLUSH request will
not touch, its better to leave it all to the mm core to handle. I.e.
it doesn't make sense to call the driver just for two instructions
(sfence + pcommit) when the mm core is taking on the cache flushing.
Either handle it all in the mm or the driver, not a mixture.
5 years, 2 months