question about ext4 block allocation
by Ross Zwisler
I recently hit an issue in my DAX testing where I was unable to get ext4 to
give me 2 MiB sized and aligned block allocations in a situation where I
thought I should be able to. I'm using a PMEM ramdisk of size 16 GiB, created
using the memmap kernel command line parameter.
# fdisk -l /dev/pmem0
Disk /dev/pmem0: 16 GiB, 17179869184 bytes, 33554432 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
The very simple test program I used to reproduce this can be found at the
bottom of this mail. Here is the quick function that I used to recreate my
filesystem each run:
# type go_ext4
go_ext4 is a function
go_ext4 ()
{
umount /dev/pmem0 2> /dev/null;
mkfs.ext4 -b 4096 -E stride=512 -F /dev/pmem0;
mount -o dax /dev/pmem0 ~/dax;
cd ~/fsync
}
To be able to easily see whether DAX is able to use PMDs instead of PTEs, you
can run with the mmots tree (git://git.cmpxchg.org/linux-mmots.git), tag
v4.10-rc4-mmots-2017-01-17-16-32.
Okay, so here's the interesting part. If I create a filesystem and run the
test so it creates a file of size 32 MiB or 128 MiB, I get a PMD fault.
Here's the corresponding tracepoint output:
test-1429 [008] .... 10573.026699: dax_pmd_fault: dev 259:0 ino 0xc shared
WRITE|ALLOW_RETRY|KILLABLE|USER address 0x40280000 vm_start 0x40000000 vm_end
0x40400000 pgoff 0x280 max_pgoff 0x7fff
test-1429 [008] .... 10573.026912: dax_pmd_insert_mapping: dev 259:0 ino 0xc
shared write address 0x40280000 length 0x200000 pfn 0x108a00 DEV|MAP
radix_entry 0x114000e
test-1429 [008] .... 10573.026917: dax_pmd_fault_done: dev 259:0 ino 0xc
shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x40280000 vm_start 0x40000000
vm_end 0x40400000 pgoff 0x280 max_pgoff 0x7fff NOPAGE
Great. That's what I want. But, if I create the filesystem and use the test
to create a file that is 64 MiB in size, the PMD fault fails because the PFN I
get from the filesystem isn't 2MiB aligned:
test-1475 [006] .... 11809.982188: dax_pmd_fault: dev 259:0 ino 0xc shared
WRITE|ALLOW_RETRY|KILLABLE|USER address 0x40280000 vm_start 0x40000000 vm_end
0x40400000 pgoff 0x280 max_pgoff 0x3fff
test-1475 [006] .... 11809.982398: dax_pmd_insert_mapping_fallback: dev 259:0
ino 0xc shared write address 0x40280000 length 0x200000 pfn 0x108601 DEV|MAP
radix_entry 0x0
test-1475 [006] .... 11809.982399: dax_pmd_fault_done: dev 259:0 ino 0xc
shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x40280000 vm_start 0x40000000
vm_end 0x40400000 pgoff 0x280 max_pgoff 0x3fff FALLBACK
The PFN for the block allocation I get from ext4 is 0x108601, which isn't
aligned, so we fail the PG_PMD_COLOUR alignment check in
dax_iomap_pmd_fault(), and use PTEs instead.
I initially saw this in a test from Xiong:
https://www.mail-archive.com/linux-nvdimm@lists.01.org/msg02615.html
and created the attached test to have a simpler reproducer. With Xiong's
test, a test on a 128 MiB sized file will have all PMDs, an on a 64 MiB file
we'll use all PTEs.
This question is important because eventually we'd like to say to customers
"do X and you should get PMDs when you use DAX", but right now I'm not sure
what X is. :)
Thanks,
- Ross
--- >8 ---
#define _GNU_SOURCE
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <linux/falloc.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <unistd.h>
#define GiB(a) ((a)*1024ULL*1024*1024)
#define MiB(a) ((a)*1024ULL*1024)
#define PAGE(a) ((a)*0x1000)
void usage(char *prog)
{
fprintf(stderr, "usage: %s <size in MiB>\n", prog);
exit(1);
}
void err_exit(char *op, unsigned long len)
{
fprintf(stderr, "%s(%s) len %lu\n", op, strerror(errno), len);
exit(1);
}
int main(int argc, char *argv[])
{
char *data_array = (char*) GiB(1); /* request a 2MiB aligned address with mmap() */
unsigned long len;
int fd;
if (argc < 2)
usage(basename(argv[0]));
len = strtoul(argv[1], NULL, 10);
if (errno == ERANGE)
err_exit("strtoul", 0);
fd = open("/root/dax/data", O_RDWR|O_CREAT, S_IRUSR|S_IWUSR);
if (fd < 0) {
perror("fd");
return 1;
}
ftruncate(fd, 0);
fallocate(fd, 0, 0, MiB(len));
data_array = mmap(data_array, PAGE(0x400), PROT_READ|PROT_WRITE,
MAP_SHARED, fd, PAGE(0));
data_array[PAGE(0x280)] = 142;
fsync(fd);
close(fd);
return 0;
}
3 years, 11 months
dax: enable DAX PMD suport for NVDIMM device
by Yoshimi Ichiyanagi
Hello.
I use the HPE 8G NVDIMM modules on a HPE DL360G9 server. Currently DAX PMD
(2iMB pages) support is disabled for NVDIMM modules in kernel 4.10.0-rc5.
PMD DAX would be enabled, if "PFN_DEV and PFN_MAP" of pmem device flags was
set at dax_pmd_insert_mapping().
But "PFN_DEV and PFN_MAP" was not set at pmem_attach_disk() with HPE NVDIMM
modules. Because the pmem_should_map_pages() did not return true at
pmem_attach_disk().
pmem_should_map_pages() would return true and DAX PMD would be enabled,
if ND_REGION_PAGEMAP flag of nd_region flags was set.
In this case, the nd_region was initialized with
acpi_nfit_register_region(), and ND_REGION_PAGEMAP of the nd_region flags
was not set in acpi_nfit_register_region(). So DAX PMD was disabled.
Is it ok to set ND_REGION_PAGEMAP of the PM and VOLATILE type nd_region
flags?
Here is the fio-2.16 script(mmap.fio file) I used for my testing:
[global]
bs=4k
size=2G
directory=/mnt/pmem1
ioengine=mmap
rw=write
I did the following:
# mkfs.ext4 /dev/pmem1
# mount -t ext4 -o dax /dev/pmem1 /mnt/pmem1
# fio mmap.fio
Here are the performance results(ND_REGION_PAGEMAP flag was off):
Run status group 0 (all jobs):
WRITE: bw=1228MiB/s (1287MB/s), 1228MiB/s-1228MiB/s (1287MB/s-1287MB/s),
io=2048MiB (2147MB), run=1668-1668msec
Here are the performance results(ND_REGION_PAGEMAP flag was on with
following patch):
Run status group 0 (all jobs):
WRITE: bw=3459MiB/s (3628MB/s), 3459MiB/s-3459MiB/s (3628MB/s-3628MB/s),
io=2048MiB (2147MB), run=592-592msec
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index 7361d00..1d3bd5a 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2096,7 +2096,7 @@ static int acpi_nfit_init_mapping(struct
acpi_nfit_desc *acpi_desc,
struct acpi_nfit_system_address *spa = nfit_spa->spa;
struct nd_blk_region_desc *ndbr_desc;
struct nfit_mem *nfit_mem;
- int blk_valid = 0;
+ int blk_valid = -1;
if (!nvdimm) {
dev_err(acpi_desc->dev, "spa%d dimm: %#x not found\n",
@@ -2116,6 +2116,7 @@ static int acpi_nfit_init_mapping(struct
acpi_nfit_desc *acpi_desc,
if (!nfit_mem || !nfit_mem->bdw) {
dev_dbg(acpi_desc->dev, "spa%d %s missing bdw\n",
spa->range_index, nvdimm_name
(nvdimm));
+ blk_valid = 0;
} else {
mapping->size = nfit_mem->bdw->capacity;
mapping->start = nfit_mem->bdw->start_address;
@@ -2135,6 +2136,9 @@ static int acpi_nfit_init_mapping(struct
acpi_nfit_desc *acpi_desc,
break;
}
+ if ( blk_valid < 0 )
+ set_bit(ND_REGION_PAGEMAP, &ndr_desc->flags);
+
return 0;
}
3 years, 11 months
[PATCH] acpi, nfit: skip ARS on machine-check-recovery capable platforms
by Dan Williams
If the platform supports machine-check-recovery then there is little
reason to kick off opportunistic scrubs to collect a media error list.
That initial scrub is only useful when it might prevent a kernel panic
from consuming poison (a media error from memory).
Cc: Vishal Verma <vishal.l.verma(a)intel.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
drivers/acpi/nfit/core.c | 6 ++++--
drivers/acpi/nfit/mce.c | 7 +++++++
drivers/acpi/nfit/nfit.h | 5 +++++
3 files changed, 16 insertions(+), 2 deletions(-)
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index 7361d00818e2..bbefd9516939 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2500,10 +2500,12 @@ static void acpi_nfit_scrub(struct work_struct *work)
list_for_each_entry(nfit_spa, &acpi_desc->spas, list) {
/*
* Flag all the ranges that still need scrubbing, but
- * register them now to make data available.
+ * register them now to make data available. If the
+ * platform supports machine-check recovery then we skip
+ * these opportunistic scans.
*/
if (!nfit_spa->nd_region) {
- nfit_spa->ars_required = 1;
+ nfit_spa->ars_required = is_ars_required();
acpi_nfit_register_region(acpi_desc, nfit_spa);
}
}
diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
index e5ce81c38eed..1e6f1e7100f9 100644
--- a/drivers/acpi/nfit/mce.c
+++ b/drivers/acpi/nfit/mce.c
@@ -92,6 +92,13 @@ static struct notifier_block nfit_mce_dec = {
.notifier_call = nfit_handle_mce,
};
+bool is_ars_required(void)
+{
+ if (static_branch_unlikely(&mcsafe_key))
+ return false;
+ return true;
+}
+
void nfit_mce_register(void)
{
mce_register_decode_chain(&nfit_mce_dec);
diff --git a/drivers/acpi/nfit/nfit.h b/drivers/acpi/nfit/nfit.h
index fc29c2e9832e..925f2a3d896e 100644
--- a/drivers/acpi/nfit/nfit.h
+++ b/drivers/acpi/nfit/nfit.h
@@ -211,6 +211,7 @@ int acpi_nfit_ars_rescan(struct acpi_nfit_desc *acpi_desc);
#ifdef CONFIG_X86_MCE
void nfit_mce_register(void);
void nfit_mce_unregister(void);
+bool is_ars_required(void);
#else
static inline void nfit_mce_register(void)
{
@@ -218,6 +219,10 @@ static inline void nfit_mce_register(void)
static inline void nfit_mce_unregister(void)
{
}
+static inline bool is_ars_required(void)
+{
+ return true;
+}
#endif
int nfit_spa_type(struct acpi_nfit_system_address *spa);
3 years, 11 months
WARN_ON_ONCE() during generic/270
by Ross Zwisler
I hit the following WARN_ON_ONCE() during generic/270 with xfs (passed through
kasan_symbolize.py):
run fstests generic/270 at 2017-02-08 10:56:07
XFS (pmem0p2): Unmounting Filesystem
XFS (pmem0p2): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
XFS (pmem0p2): Mounting V5 Filesystem
XFS (pmem0p2): Ending clean mount
XFS (pmem0p2): Quotacheck needed: Please wait.
XFS (pmem0p2): Quotacheck: Done.
XFS (pmem0p2): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
------------[ cut here ]------------
WARNING: CPU: 7 PID: 23652 at fs/xfs/libxfs/xfs_bmap.c:5981[< none >] xfs_bmse_shift_one+0x3da/0x4c0 fs/xfs/libxfs/xfs_bmap.c:5981
Modules linked in: dax_pmem nd_pmem dax nd_btt nd_e820 libnvdimm
CPU: 4 PID: 23652 Comm: 23288.fsstress. Not tainted 4.10.0-rc7-00065-g926af627 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
Call Trace:
[< inline >] __dump_stack lib/dump_stack.c:15
[< none >] dump_stack+0x86/0xc3 lib/dump_stack.c:51
[< none >] __warn+0xcb/0xf0 kernel/panic.c:547
[< none >] warn_slowpath_null+0x1d/0x20 kernel/panic.c:582
[< none >] xfs_bmse_shift_one+0x3da/0x4c0 fs/xfs/libxfs/xfs_bmap.c:5981
[< none >] xfs_bmap_shift_extents+0x305/0x490 fs/xfs/libxfs/xfs_bmap.c:6144
[< none >] xfs_shift_file_space+0x25f/0x320 fs/xfs/xfs_bmap_util.c:1475
[< none >] xfs_insert_file_space+0x5a/0x180 fs/xfs/xfs_bmap_util.c:1548
[< none >] xfs_file_fallocate+0x34c/0x3b0 fs/xfs/xfs_file.c:844
?[< none >] rcu_sync_lockdep_assert+0x2f/0x60 kernel/rcu/sync.c:68
[< none >] vfs_fallocate+0x15a/0x230 fs/open.c:320
[< inline >] SYSC_fallocate fs/open.c:343
[< none >] SyS_fallocate+0x48/0x80 fs/open.c:337
[< none >] entry_SYSCALL_64_fastpath+0x1f/0xc2 /home/rzwisler/project/linux/arch/x86/entry/entry_64.S:204
RIP: 0033:0x7f34dc4ff0ca
RSP: 002b:00007ffcffa58058 EFLAGS: 00000246 ORIG_RAX: 000000000000011d
RAX: ffffffffffffffda RBX: 0000000000000166 RCX: 00007f34dc4ff0ca
RDX: 00000000000ba000 RSI: 0000000000000020 RDI: 0000000000000003
RBP: 0000000000000003 R08: 000000000000007b R09: 00007ffcffa5807c
R10: 00000000000bc000 R11: 0000000000000246 R12: 00007f34d8000de0
R13: 00000000ffffffff R14: 000000000000af4a R15: 0000000000000000
---[ end trace e24f5d4cbfc216f6 ]---
This trace is with the current linux/master:
commit 926af6273fc6 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
though I initially his this issue with a v4.9 kernel. My test setup is a pair
of PMEM ramdisks made with the memmap command line parameter, and my
test filesystem is mounted with DAX.
This can be reproduced pretty easily by just running generic/270 in a
loop.
Thanks,
- Ross
3 years, 11 months
New notice to Appear in Court
by instituicoes@web2.host-services.com
Dear Sir or Madam,
You have to appear in the Court on the February 16 for your case hearing.
You are kindly asked to prepare and bring the documents relating to the case to Court on the specified date.
We attached the Notice to this e-mail.
Yours faithfully,
,
Court Secretary.
3 years, 11 months
[PATCH] mm: replace FAULT_FLAG_SIZE with parameter to huge_fault
by Dave Jiang
Since the introduction of FAULT_FLAG_SIZE to the vm_fault flag, it has
been somewhat painful with getting the flags set and removed at the
correct locations. More than one kernel oops was introduced due to
difficulties of getting the placement correctly. Removing the flag
values and introducing an input parameter to huge_fault that indicates
the size of the page entry. This makes the code easier to trace and
should avoid the issues we see with the fault flags where removal of the
flag was necessary in the fallback paths.
Signed-off-by: Dave Jiang <dave.jiang(a)intel.com>
---
drivers/dax/dax.c | 18 ++++++++++++------
fs/dax.c | 9 +++++----
fs/ext2/file.c | 2 +-
fs/ext4/file.c | 12 +++++++++---
fs/xfs/xfs_file.c | 9 +++++----
include/linux/dax.h | 3 ++-
include/linux/mm.h | 14 ++++++++------
mm/memory.c | 17 ++++-------------
8 files changed, 46 insertions(+), 38 deletions(-)
diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
index b90bb30..b75c772 100644
--- a/drivers/dax/dax.c
+++ b/drivers/dax/dax.c
@@ -538,7 +538,8 @@ static int __dax_dev_pud_fault(struct dax_dev *dax_dev, struct vm_fault *vmf)
}
#endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
-static int dax_dev_fault(struct vm_fault *vmf)
+static int dax_dev_huge_fault(struct vm_fault *vmf,
+ enum page_entry_size pe_size)
{
int rc;
struct file *filp = vmf->vma->vm_file;
@@ -550,14 +551,14 @@ static int dax_dev_fault(struct vm_fault *vmf)
vmf->vma->vm_start, vmf->vma->vm_end);
rcu_read_lock();
- switch (vmf->flags & FAULT_FLAG_SIZE_MASK) {
- case FAULT_FLAG_SIZE_PTE:
+ switch (pe_size) {
+ case PE_SIZE_PTE:
rc = __dax_dev_pte_fault(dax_dev, vmf);
break;
- case FAULT_FLAG_SIZE_PMD:
+ case PE_SIZE_PMD:
rc = __dax_dev_pmd_fault(dax_dev, vmf);
break;
- case FAULT_FLAG_SIZE_PUD:
+ case PE_SIZE_PUD:
rc = __dax_dev_pud_fault(dax_dev, vmf);
break;
default:
@@ -568,9 +569,14 @@ static int dax_dev_fault(struct vm_fault *vmf)
return rc;
}
+static int dax_dev_fault(struct vm_fault *vmf)
+{
+ return dax_dev_huge_fault(vmf, PE_SIZE_PTE);
+}
+
static const struct vm_operations_struct dax_dev_vm_ops = {
.fault = dax_dev_fault,
- .huge_fault = dax_dev_fault,
+ .huge_fault = dax_dev_huge_fault,
};
static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
diff --git a/fs/dax.c b/fs/dax.c
index 25f791d..97b8ecb 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1446,12 +1446,13 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf, struct iomap_ops *ops)
* has done all the necessary locking for page fault to proceed
* successfully.
*/
-int dax_iomap_fault(struct vm_fault *vmf, struct iomap_ops *ops)
+int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
+ struct iomap_ops *ops)
{
- switch (vmf->flags & FAULT_FLAG_SIZE_MASK) {
- case FAULT_FLAG_SIZE_PTE:
+ switch (pe_size) {
+ case PE_SIZE_PTE:
return dax_iomap_pte_fault(vmf, ops);
- case FAULT_FLAG_SIZE_PMD:
+ case PE_SIZE_PMD:
return dax_iomap_pmd_fault(vmf, ops);
default:
return VM_FAULT_FALLBACK;
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 6873883..b21891a 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -99,7 +99,7 @@ static int ext2_dax_fault(struct vm_fault *vmf)
}
down_read(&ei->dax_sem);
- ret = dax_iomap_fault(vmf, &ext2_iomap_ops);
+ ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &ext2_iomap_ops);
up_read(&ei->dax_sem);
if (vmf->flags & FAULT_FLAG_WRITE)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 51d7155..e8ab46e 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -255,7 +255,8 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
}
#ifdef CONFIG_FS_DAX
-static int ext4_dax_fault(struct vm_fault *vmf)
+static int ext4_dax_huge_fault(struct vm_fault *vmf,
+ enum page_entry_size pe_size)
{
int result;
struct inode *inode = file_inode(vmf->vma->vm_file);
@@ -267,7 +268,7 @@ static int ext4_dax_fault(struct vm_fault *vmf)
file_update_time(vmf->vma->vm_file);
}
down_read(&EXT4_I(inode)->i_mmap_sem);
- result = dax_iomap_fault(vmf, &ext4_iomap_ops);
+ result = dax_iomap_fault(vmf, pe_size, &ext4_iomap_ops);
up_read(&EXT4_I(inode)->i_mmap_sem);
if (write)
sb_end_pagefault(sb);
@@ -275,6 +276,11 @@ static int ext4_dax_fault(struct vm_fault *vmf)
return result;
}
+static int ext4_dax_fault(struct vm_fault *vmf)
+{
+ return ext4_dax_huge_fault(vmf, PE_SIZE_PTE);
+}
+
/*
* Handle write fault for VM_MIXEDMAP mappings. Similarly to ext4_dax_fault()
* handler we check for races agaist truncate. Note that since we cycle through
@@ -307,7 +313,7 @@ static int ext4_dax_pfn_mkwrite(struct vm_fault *vmf)
static const struct vm_operations_struct ext4_dax_vm_ops = {
.fault = ext4_dax_fault,
- .huge_fault = ext4_dax_fault,
+ .huge_fault = ext4_dax_huge_fault,
.page_mkwrite = ext4_dax_fault,
.pfn_mkwrite = ext4_dax_pfn_mkwrite,
};
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c4fe261..c37f435 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1385,7 +1385,7 @@ xfs_filemap_page_mkwrite(
xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
if (IS_DAX(inode)) {
- ret = dax_iomap_fault(vmf, &xfs_iomap_ops);
+ ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &xfs_iomap_ops);
} else {
ret = iomap_page_mkwrite(vmf, &xfs_iomap_ops);
ret = block_page_mkwrite_return(ret);
@@ -1412,7 +1412,7 @@ xfs_filemap_fault(
xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
if (IS_DAX(inode))
- ret = dax_iomap_fault(vmf, &xfs_iomap_ops);
+ ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &xfs_iomap_ops);
else
ret = filemap_fault(vmf);
xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
@@ -1429,7 +1429,8 @@ xfs_filemap_fault(
*/
STATIC int
xfs_filemap_huge_fault(
- struct vm_fault *vmf)
+ struct vm_fault *vmf,
+ enum page_entry_size pe_size)
{
struct inode *inode = file_inode(vmf->vma->vm_file);
struct xfs_inode *ip = XFS_I(inode);
@@ -1446,7 +1447,7 @@ xfs_filemap_huge_fault(
}
xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
- ret = dax_iomap_fault(vmf, &xfs_iomap_ops);
+ ret = dax_iomap_fault(vmf, pe_size, &xfs_iomap_ops);
xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
if (vmf->flags & FAULT_FLAG_WRITE)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index a3bfa26..df63730 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -38,7 +38,8 @@ static inline void *dax_radix_locked_entry(sector_t sector, unsigned long flags)
ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
struct iomap_ops *ops);
-int dax_iomap_fault(struct vm_fault *vmf, struct iomap_ops *ops);
+int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
+ struct iomap_ops *ops);
int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index);
int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7646ae5..7b11431 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -285,11 +285,6 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_REMOTE 0x80 /* faulting for non current tsk/mm */
#define FAULT_FLAG_INSTRUCTION 0x100 /* The fault was during an instruction fetch */
-#define FAULT_FLAG_SIZE_MASK 0x7000 /* Support up to 8-level page tables */
-#define FAULT_FLAG_SIZE_PTE 0x0000 /* First level (eg 4k) */
-#define FAULT_FLAG_SIZE_PMD 0x1000 /* Second level (eg 2MB) */
-#define FAULT_FLAG_SIZE_PUD 0x2000 /* Third level (eg 1GB) */
-
#define FAULT_FLAG_TRACE \
{ FAULT_FLAG_WRITE, "WRITE" }, \
{ FAULT_FLAG_MKWRITE, "MKWRITE" }, \
@@ -349,6 +344,13 @@ struct vm_fault {
*/
};
+/* page entry size for vm->huge_fault() */
+enum page_entry_size {
+ PE_SIZE_PTE = 0,
+ PE_SIZE_PMD,
+ PE_SIZE_PUD,
+};
+
/*
* These are the virtual MM functions - opening of an area, closing and
* unmapping it (needed to keep files on disk up-to-date etc), pointer
@@ -359,7 +361,7 @@ struct vm_operations_struct {
void (*close)(struct vm_area_struct * area);
int (*mremap)(struct vm_area_struct * area);
int (*fault)(struct vm_fault *vmf);
- int (*huge_fault)(struct vm_fault *vmf);
+ int (*huge_fault)(struct vm_fault *vmf, enum page_entry_size pe_size);
void (*map_pages)(struct vm_fault *vmf,
pgoff_t start_pgoff, pgoff_t end_pgoff);
diff --git a/mm/memory.c b/mm/memory.c
index 41e2a2d..6040b74 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3489,7 +3489,7 @@ static int create_huge_pmd(struct vm_fault *vmf)
if (vma_is_anonymous(vmf->vma))
return do_huge_pmd_anonymous_page(vmf);
if (vmf->vma->vm_ops->huge_fault)
- return vmf->vma->vm_ops->huge_fault(vmf);
+ return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);
return VM_FAULT_FALLBACK;
}
@@ -3498,7 +3498,7 @@ static int wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
if (vma_is_anonymous(vmf->vma))
return do_huge_pmd_wp_page(vmf, orig_pmd);
if (vmf->vma->vm_ops->huge_fault)
- return vmf->vma->vm_ops->huge_fault(vmf);
+ return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);
/* COW handled on pte level: split pmd */
VM_BUG_ON_VMA(vmf->vma->vm_flags & VM_SHARED, vmf->vma);
@@ -3519,7 +3519,7 @@ static int create_huge_pud(struct vm_fault *vmf)
if (vma_is_anonymous(vmf->vma))
return VM_FAULT_FALLBACK;
if (vmf->vma->vm_ops->huge_fault)
- return vmf->vma->vm_ops->huge_fault(vmf);
+ return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD);
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
return VM_FAULT_FALLBACK;
}
@@ -3531,7 +3531,7 @@ static int wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
if (vma_is_anonymous(vmf->vma))
return VM_FAULT_FALLBACK;
if (vmf->vma->vm_ops->huge_fault)
- return vmf->vma->vm_ops->huge_fault(vmf);
+ return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD);
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
return VM_FAULT_FALLBACK;
}
@@ -3659,7 +3659,6 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
if (!vmf.pud)
return VM_FAULT_OOM;
if (pud_none(*vmf.pud) && transparent_hugepage_enabled(vma)) {
- vmf.flags |= FAULT_FLAG_SIZE_PUD;
ret = create_huge_pud(&vmf);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
@@ -3670,8 +3669,6 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
if (pud_trans_huge(orig_pud) || pud_devmap(orig_pud)) {
unsigned int dirty = flags & FAULT_FLAG_WRITE;
- vmf.flags |= FAULT_FLAG_SIZE_PUD;
-
/* NUMA case for anonymous PUDs would go here */
if (dirty && !pud_write(orig_pud)) {
@@ -3689,18 +3686,14 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
if (!vmf.pmd)
return VM_FAULT_OOM;
if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
- vmf.flags |= FAULT_FLAG_SIZE_PMD;
ret = create_huge_pmd(&vmf);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
- /* fall through path, remove PMD flag */
- vmf.flags &= ~FAULT_FLAG_SIZE_PMD;
} else {
pmd_t orig_pmd = *vmf.pmd;
barrier();
if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
- vmf.flags |= FAULT_FLAG_SIZE_PMD;
if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
return do_huge_pmd_numa_page(&vmf, orig_pmd);
@@ -3709,8 +3702,6 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
ret = wp_huge_pmd(&vmf, orig_pmd);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
- /* fall through path, remove PUD flag */
- vmf.flags &= ~FAULT_FLAG_SIZE_PUD;
} else {
huge_pmd_set_accessed(&vmf, orig_pmd);
return 0;
3 years, 11 months
fix write synchronization for DAX
by Christoph Hellwig
While I've fixed both ext4 and XFS to not incorrectly allow parallel
writers when mounting with -o dax ext4 still has this issue after the
iomap conversion.
Patch 1 fixes it, and patch 2 adds a lockdep assert to catch any new
file systems copy and pasting from the direct I/O path.
3 years, 11 months
[-mm PATCH] mm: fix get_user_pages() vs device-dax pud mappings
by Dan Williams
A new unit test for the device-dax 1GB enabling currently fails with
this warning before hanging the test thread:
WARNING: CPU: 0 PID: 21 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x1e3/0x1f0
percpu ref (dax_pmem_percpu_release [dax_pmem]) <= 0 (0) after switching to atomic
[..]
CPU: 0 PID: 21 Comm: rcuos/1 Tainted: G O 4.10.0-rc7-next-20170207+ #944
[..]
Call Trace:
dump_stack+0x86/0xc3
__warn+0xcb/0xf0
warn_slowpath_fmt+0x5f/0x80
? rcu_nocb_kthread+0x27a/0x510
? dax_pmem_percpu_exit+0x50/0x50 [dax_pmem]
percpu_ref_switch_to_atomic_rcu+0x1e3/0x1f0
? percpu_ref_exit+0x60/0x60
rcu_nocb_kthread+0x339/0x510
? rcu_nocb_kthread+0x27a/0x510
kthread+0x101/0x140
The get_user_pages() path needs to arrange for references to be taken
against the dev_pagemap instance backing the pud mapping. Refactor the
existing __gup_device_huge_pmd() to also account for the pud case.
Cc: Dave Jiang <dave.jiang(a)intel.com>
Cc: Matthew Wilcox <mawilcox(a)microsoft.com>
Cc: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Nilesh Choudhury <nilesh.choudhury(a)oracle.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
arch/x86/mm/gup.c | 28 ++++++++++++++++++++++++----
1 file changed, 24 insertions(+), 4 deletions(-)
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 0d4fb3ebbbac..99c7805a9693 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -154,14 +154,12 @@ static inline void get_head_page_multiple(struct page *page, int nr)
SetPageReferenced(page);
}
-static int __gup_device_huge_pmd(pmd_t pmd, unsigned long addr,
+static int __gup_device_huge(unsigned long pfn, unsigned long addr,
unsigned long end, struct page **pages, int *nr)
{
int nr_start = *nr;
- unsigned long pfn = pmd_pfn(pmd);
struct dev_pagemap *pgmap = NULL;
- pfn += (addr & ~PMD_MASK) >> PAGE_SHIFT;
do {
struct page *page = pfn_to_page(pfn);
@@ -180,6 +178,24 @@ static int __gup_device_huge_pmd(pmd_t pmd, unsigned long addr,
return 1;
}
+static int __gup_device_huge_pmd(pmd_t pmd, unsigned long addr,
+ unsigned long end, struct page **pages, int *nr)
+{
+ unsigned long fault_pfn;
+
+ fault_pfn = pmd_pfn(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+ return __gup_device_huge(fault_pfn, addr, end, pages, nr);
+}
+
+static int __gup_device_huge_pud(pud_t pud, unsigned long addr,
+ unsigned long end, struct page **pages, int *nr)
+{
+ unsigned long fault_pfn;
+
+ fault_pfn = pud_pfn(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+ return __gup_device_huge(fault_pfn, addr, end, pages, nr);
+}
+
static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
unsigned long end, int write, struct page **pages, int *nr)
{
@@ -251,9 +267,13 @@ static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
if (!pte_allows_gup(pud_val(pud), write))
return 0;
+
+ VM_BUG_ON(!pfn_valid(pud_pfn(pud)));
+ if (pud_devmap(pud))
+ return __gup_device_huge_pud(pud, addr, end, pages, nr);
+
/* hugepages are never "special" */
VM_BUG_ON(pud_flags(pud) & _PAGE_SPECIAL);
- VM_BUG_ON(!pfn_valid(pud_pfn(pud)));
refs = 0;
head = pud_page(pud);
3 years, 11 months
[GIT PULL] libnvdimm fixes for 4.10
by Dan Williams
Hi Linus, please pull from:
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm libnvdimm-fixes
...to receive:
* Fix a crash that can result when SIGINT is sent to a process that is
awaiting completion of an address range scrub command. We were not
properly cleaning up the workqueue after wait_event_interruptible().
* Fix a memory hotplug failure condition that results from not
reserving enough space out of persistent memory for the memmap. By
default we align to 2M allocations that the memory hotplug code
assumes, but if the administrator specifies a non-default 4K-alignment
then we can fail to correctly size the reservation.
* A one line fix to improve the predictability of libnvdimm block
device names. A common operation is to reconfigure /dev/pmem0 into a
different mode. For example, a reconfiguration might set a new mode
that reserves some of the capacity for a struct page memmap array. It
surprises users if the device name changes to "/dev/pmem0.1" after the
mode change and then back to /dev/pmem0 after a reboot.
* Add 'const' to some function pointer tables.
None of these are showstoppers for 4.10 and could wait for 4.11 merge
window, but they are low enough risk for this late in the cycle and
the fixes have waiting users . They have received a build success
notification from the 0day robot, pass the latest ndctl unit tests,
and appeared in next-20170206.
The following changes since commit 7a308bb3016f57e5be11a677d15b821536419d36:
Linux 4.10-rc5 (2017-01-22 12:54:15 -0800)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm libnvdimm-fixes
for you to fetch changes up to bfb34527a32a1a576d9bfb7026d3ab0369a6cd60:
libnvdimm, pfn: fix memmap reservation size versus 4K alignment
(2017-02-04 14:47:31 -0800)
----------------------------------------------------------------
Bhumika Goyal (1):
nvdimm: constify device_type structures
Dan Williams (3):
libnvdimm, namespace: do not delete namespace-id 0
acpi, nfit: fix acpi_nfit_flush_probe() crash
libnvdimm, pfn: fix memmap reservation size versus 4K alignment
drivers/acpi/nfit/core.c | 6 +++++-
drivers/nvdimm/namespace_devs.c | 17 ++++++++++-------
drivers/nvdimm/pfn_devs.c | 7 ++-----
3 files changed, 17 insertions(+), 13 deletions(-)
commit 970d14e3989160ee9e97c7d75ecbc893fd29dab9
Author: Bhumika Goyal <bhumirks(a)gmail.com>
Date: Wed Jan 25 00:54:07 2017 +0530
nvdimm: constify device_type structures
Declare device_type structure as const as it is only stored in the
type field of a device structure. This field is of type const, so add
const to declaration of device_type structure.
File size before:
text data bss dec hex filename
19278 3199 16 22493 57dd nvdimm/namespace_devs.o
File size after:
text data bss dec hex filename
19929 3160 16 23105 5a41 nvdimm/namespace_devs.o
Signed-off-by: Bhumika Goyal <bhumirks(a)gmail.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
commit 9d032f4201d39e5cf43a8709a047e481f5723fdc
Author: Dan Williams <dan.j.williams(a)intel.com>
Date: Wed Jan 25 00:54:07 2017 +0530
libnvdimm, namespace: do not delete namespace-id 0
Given that the naming of pmem devices changes from the pmemX form to the
pmemX.Y form when namespace id is greater than 0, arrange for namespaces
with id-0 to be exempt from deletion. Otherwise a simple reconfiguration
of an existing namespace to a new mode results in a name change of the
resulting block device:
# ndctl list --namespace=namespace1.0
{
"dev":"namespace1.0",
"mode":"raw",
"size":2147483648,
"uuid":"3dadf3dc-89b9-4b24-b20e-abc8a4707ce3",
"blockdev":"pmem1"
}
# ndctl create-namespace --reconfig=namespace1.0 --mode=memory --force
{
"dev":"namespace1.1",
"mode":"memory",
"size":2111832064,
"uuid":"7b4a6341-7318-4219-a02c-fb57c0bbf613",
"blockdev":"pmem1.1"
}
This change does require tooling changes to explicitly look for
namespaceX.0 if the seed has already advanced to another namespace.
Cc: <stable(a)vger.kernel.org>
Fixes: 98a29c39dc68 ("libnvdimm, namespace: allow creation of
multiple pmem-namespaces per region")
Reviewed-by: Johannes Thumshirn <jthumshirn(a)suse.de>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
commit e471486c13b82b1338d49c798f78bb62b1ed0a9e
Author: Dan Williams <dan.j.williams(a)intel.com>
Date: Thu Feb 2 10:31:00 2017 -0800
acpi, nfit: fix acpi_nfit_flush_probe() crash
We queue an on-stack work item to 'nfit_wq' and wait for it to complete
as part of a 'flush_probe' request. However, if the user cancels the
wait we need to make sure the item is flushed from the queue otherwise
we are leaving an out-of-scope stack address on the work list.
BUG: unable to handle kernel paging request at ffffbcb3c72f7cd0
IP: [<ffffffffa9413a7b>] __list_add+0x1b/0xb0
[..]
RIP: 0010:[<ffffffffa9413a7b>] [<ffffffffa9413a7b>] __list_add+0x1b/0xb0
RSP: 0018:ffffbcb3c7ba7c00 EFLAGS: 00010046
[..]
Call Trace:
[<ffffffffa90bb11a>] insert_work+0x3a/0xc0
[<ffffffffa927fdda>] ? seq_open+0x5a/0xa0
[<ffffffffa90bb30a>] __queue_work+0x16a/0x460
[<ffffffffa90bbb08>] queue_work_on+0x38/0x40
[<ffffffffc0cf2685>] acpi_nfit_flush_probe+0x95/0xc0 [nfit]
[<ffffffffc0cf25d0>] ? nfit_visible+0x40/0x40 [nfit]
[<ffffffffa9571495>] wait_probe_show+0x25/0x60
[<ffffffffa9546b30>] dev_attr_show+0x20/0x50
Fixes: 7ae0fa439faf ("nfit, libnvdimm: async region scrub workqueue")
Cc: <stable(a)vger.kernel.org>
Reviewed-by: Vishal Verma <vishal.l.verma(a)intel.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
commit bfb34527a32a1a576d9bfb7026d3ab0369a6cd60
Author: Dan Williams <dan.j.williams(a)intel.com>
Date: Sat Feb 4 14:47:31 2017 -0800
libnvdimm, pfn: fix memmap reservation size versus 4K alignment
When vmemmap_populate() allocates space for the memmap it does so in 2MB
sized chunks. The libnvdimm-pfn driver incorrectly accounts for this
when the alignment of the device is set to 4K. When this happens we
trigger memory allocation failures in altmap_alloc_block_buf() and
trigger warnings of the form:
WARNING: CPU: 0 PID: 3376 at arch/x86/mm/init_64.c:656
arch_add_memory+0xe4/0xf0
[..]
Call Trace:
dump_stack+0x86/0xc3
__warn+0xcb/0xf0
warn_slowpath_null+0x1d/0x20
arch_add_memory+0xe4/0xf0
devm_memremap_pages+0x29b/0x4e0
Fixes: 315c562536c4 ("libnvdimm, pfn: add 'align' attribute,
default to HPAGE_SIZE")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
3 years, 11 months