[PATCH v3 0/2] Support ACPI 6.1 update in NFIT Control Region Structure
by Toshi Kani
ACPI 6.1, Table 5-133, updates NVDIMM Control Region Structure as
follows.
- Valid Fields, Manufacturing Location, and Manufacturing Date
are added from reserved range. No change in the structure size.
- IDs (SPD values) are stored as arrays of bytes (i.e. big-endian
format). The spec clarifies that they need to be represented
as arrays of bytes as well.
Patch 1 changes the NFIT driver to comply with ACPI 6.1.
Patch 2 adds a new sysfs file "id" to show NVDIMM ID defined in ACPI 6.1.
The patch-set applies on linux-pm.git acpica.
link: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
---
v3:
- Need to coordinate with ACPICA update (Bob Moore, Dan Williams)
- Integrate with ACPICA changes in struct acpi_nfit_control_region.
(commit 138a95547ab0)
v2:
- Remove 'mfg_location' and 'mfg_date'. (Dan Williams)
- Rename 'unique_id' to 'id' and make this change as a separate patch.
(Dan Williams)
---
Toshi Kani (3):
1/2 acpi/nfit: Update nfit driver to comply with ACPI 6.1
2/3 acpi/nfit: Add sysfs "id" for NVDIMM ID
---
drivers/acpi/nfit.c | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)
4 years
Enabling peer to peer device transactions for PCIe devices
by Deucher, Alexander
This is certainly not the first time this has been brought up, but I'd like to try and get some consensus on the best way to move this forward. Allowing devices to talk directly improves performance and reduces latency by avoiding the use of staging buffers in system memory. Also in cases where both devices are behind a switch, it avoids the CPU entirely. Most current APIs (DirectGMA, PeerDirect, CUDA, HSA) that deal with this are pointer based. Ideally we'd be able to take a CPU virtual address and be able to get to a physical address taking into account IOMMUs, etc. Having struct pages for the memory would allow it to work more generally and wouldn't require as much explicit support in drivers that wanted to use it.
Some use cases:
1. Storage devices streaming directly to GPU device memory
2. GPU device memory to GPU device memory streaming
3. DVB/V4L/SDI devices streaming directly to GPU device memory
4. DVB/V4L/SDI devices streaming directly to storage devices
Here is a relatively simple example of how this could work for testing. This is obviously not a complete solution.
- Device memory will be registered with Linux memory sub-system by created corresponding struct page structures for device memory
- get_user_pages_fast() will return corresponding struct pages when CPU address points to the device memory
- put_page() will deal with struct pages for device memory
Previously proposed solutions and related proposals:
1.P2P DMA
DMA-API/PCI map_peer_resource support for peer-to-peer (http://www.spinics.net/lists/linux-pci/msg44560.html)
Pros: Low impact, already largely reviewed.
Cons: requires explicit support in all drivers that want to support it, doesn't handle S/G in device memory.
2. ZONE_DEVICE IO
Direct I/O and DMA for persistent memory (https://lwn.net/Articles/672457/)
Add support for ZONE_DEVICE IO memory with struct pages. (https://patchwork.kernel.org/patch/8583221/)
Pro: Doesn't waste system memory for ZONE metadata
Cons: CPU access to ZONE metadata slow, may be lost, corrupted on device reset.
3. DMA-BUF
RDMA subsystem DMA-BUF support (http://www.spinics.net/lists/linux-rdma/msg38748.html)
Pros: uses existing dma-buf interface
Cons: dma-buf is handle based, requires explicit dma-buf support in drivers.
4. iopmem
iopmem : A block device for PCIe memory (https://lwn.net/Articles/703895/)
5. HMM
Heterogeneous Memory Management (http://lkml.iu.edu/hypermail/linux/kernel/1611.2/02473.html)
6. Some new mmap-like interface that takes a userptr and a length and returns a dma-buf and offset?
Alex
4 years, 8 months
[PATCH 0/6] introduce DAX tracepoint support
by Ross Zwisler
Tracepoints are the standard way to capture debugging and tracing
information in many parts of the kernel, including the XFS and ext4
filesystems. This series creates a tracepoint header for FS DAX and add
the first few DAX tracepoints to the PMD fault handler. This allows the
tracing for DAX to be done in the same way as the filesystem tracing so
that developers can look at them together and get a coherent idea of what
the system is doing.
I do intend to add tracepoints to the normal 4k DAX fault path and to the
DAX I/O path, but I wanted to get feedback on the PMD tracepoints before I
went any further.
This series is based on Jan Kara's "dax: Clear dirty bits after flushing
caches" series:
https://lists.01.org/pipermail/linux-nvdimm/2016-November/007864.html
I've pushed a git tree with this work here:
https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=dax...
Ross Zwisler (6):
dax: fix build breakage with ext4, dax and !iomap
dax: remove leading space from labels
dax: add tracepoint infrastructure, PMD tracing
dax: update MAINTAINERS entries for FS DAX
dax: add tracepoints to dax_pmd_load_hole()
dax: add tracepoints to dax_pmd_insert_mapping()
MAINTAINERS | 4 +-
fs/Kconfig | 1 +
fs/dax.c | 78 ++++++++++++++----------
fs/ext2/Kconfig | 1 -
include/linux/mm.h | 14 +++++
include/linux/pfn_t.h | 6 ++
include/trace/events/fs_dax.h | 135 ++++++++++++++++++++++++++++++++++++++++++
7 files changed, 206 insertions(+), 33 deletions(-)
create mode 100644 include/trace/events/fs_dax.h
--
2.7.4
5 years, 3 months
[PATCH v2 0/4] Write protect DAX PMDs in *sync path
by Ross Zwisler
Currently dax_mapping_entry_mkclean() fails to clean and write protect the
pmd_t of a DAX PMD entry during an *sync operation. This can result in
data loss, as detailed in patch 4.
You can find a working tree here:
https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=dax...
This series applies cleanly to mmotm-2016-12-19-16-31.
Changes since v1:
- Included Dan's patch to kill DAX support for UML.
- Instead of wrapping the DAX PMD code in dax_mapping_entry_mkclean() in
an #ifdef, we now create a stub for pmdp_huge_clear_flush() for the case
when CONFIG_TRANSPARENT_HUGEPAGE isn't defined. (Dan & Jan)
Dan Williams (1):
dax: kill uml support
Ross Zwisler (3):
dax: add stub for pmdp_huge_clear_flush()
mm: add follow_pte_pmd()
dax: wrprotect pmd_t in dax_mapping_entry_mkclean
fs/Kconfig | 2 +-
fs/dax.c | 49 ++++++++++++++++++++++++++++++-------------
include/asm-generic/pgtable.h | 10 +++++++++
include/linux/mm.h | 4 ++--
mm/memory.c | 41 ++++++++++++++++++++++++++++--------
5 files changed, 79 insertions(+), 27 deletions(-)
--
2.7.4
5 years, 5 months
[PATCH v5 1/2] mm, dax: make pmd_fault() and friends to be the same as fault()
by Dave Jiang
Instead of passing in multiple parameters in the pmd_fault() handler,
a vmf can be passed in just like a fault() handler. This will simplify
code and remove the need for the actual pmd fault handlers to allocate a
vmf. Related functions are also modified to do the same.
Signed-off-by: Dave Jiang <dave.jiang(a)intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Reviewed-by: Jan Kara <jack(a)suse.cz>
---
drivers/dax/dax.c | 16 +++++++---------
fs/dax.c | 42 ++++++++++++++++++-----------------------
fs/ext4/file.c | 9 ++++-----
fs/xfs/xfs_file.c | 10 ++++------
include/linux/dax.h | 7 +++----
include/linux/mm.h | 3 +--
include/trace/events/fs_dax.h | 15 +++++++--------
mm/memory.c | 6 ++----
8 files changed, 46 insertions(+), 62 deletions(-)
diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
index c753a4c..947e49a 100644
--- a/drivers/dax/dax.c
+++ b/drivers/dax/dax.c
@@ -379,10 +379,9 @@ static int dax_dev_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
}
static int __dax_dev_pmd_fault(struct dax_dev *dax_dev,
- struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd,
- unsigned int flags)
+ struct vm_area_struct *vma, struct vm_fault *vmf)
{
- unsigned long pmd_addr = addr & PMD_MASK;
+ unsigned long pmd_addr = vmf->address & PMD_MASK;
struct device *dev = &dax_dev->dev;
struct dax_region *dax_region;
phys_addr_t phys;
@@ -414,23 +413,22 @@ static int __dax_dev_pmd_fault(struct dax_dev *dax_dev,
pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
- return vmf_insert_pfn_pmd(vma, addr, pmd, pfn,
- flags & FAULT_FLAG_WRITE);
+ return vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, pfn,
+ vmf->flags & FAULT_FLAG_WRITE);
}
-static int dax_dev_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
- pmd_t *pmd, unsigned int flags)
+static int dax_dev_pmd_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
int rc;
struct file *filp = vma->vm_file;
struct dax_dev *dax_dev = filp->private_data;
dev_dbg(&dax_dev->dev, "%s: %s: %s (%#lx - %#lx)\n", __func__,
- current->comm, (flags & FAULT_FLAG_WRITE)
+ current->comm, (vmf->flags & FAULT_FLAG_WRITE)
? "write" : "read", vma->vm_start, vma->vm_end);
rcu_read_lock();
- rc = __dax_dev_pmd_fault(dax_dev, vma, addr, pmd, flags);
+ rc = __dax_dev_pmd_fault(dax_dev, vma, vmf);
rcu_read_unlock();
return rc;
diff --git a/fs/dax.c b/fs/dax.c
index d3fe880..446e861 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1310,18 +1310,17 @@ static int dax_pmd_load_hole(struct vm_area_struct *vma, pmd_t *pmd,
return VM_FAULT_FALLBACK;
}
-int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
- pmd_t *pmd, unsigned int flags, struct iomap_ops *ops)
+int dax_iomap_pmd_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+ struct iomap_ops *ops)
{
struct address_space *mapping = vma->vm_file->f_mapping;
- unsigned long pmd_addr = address & PMD_MASK;
- bool write = flags & FAULT_FLAG_WRITE;
+ unsigned long pmd_addr = vmf->address & PMD_MASK;
+ bool write = vmf->flags & FAULT_FLAG_WRITE;
unsigned int iomap_flags = (write ? IOMAP_WRITE : 0) | IOMAP_FAULT;
struct inode *inode = mapping->host;
int result = VM_FAULT_FALLBACK;
struct iomap iomap = { 0 };
- pgoff_t max_pgoff, pgoff;
- struct vm_fault vmf;
+ pgoff_t max_pgoff;
void *entry;
loff_t pos;
int error;
@@ -1331,10 +1330,10 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
* supposed to hold locks serializing us with truncate / punch hole so
* this is a reliable test.
*/
- pgoff = linear_page_index(vma, pmd_addr);
+ vmf->pgoff = linear_page_index(vma, pmd_addr);
max_pgoff = (i_size_read(inode) - 1) >> PAGE_SHIFT;
- trace_dax_pmd_fault(inode, vma, address, flags, pgoff, max_pgoff, 0);
+ trace_dax_pmd_fault(inode, vma, vmf, max_pgoff, 0);
/* Fall back to PTEs if we're going to COW */
if (write && !(vma->vm_flags & VM_SHARED))
@@ -1346,13 +1345,13 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
if ((pmd_addr + PMD_SIZE) > vma->vm_end)
goto fallback;
- if (pgoff > max_pgoff) {
+ if (vmf->pgoff > max_pgoff) {
result = VM_FAULT_SIGBUS;
goto out;
}
/* If the PMD would extend beyond the file size */
- if ((pgoff | PG_PMD_COLOUR) > max_pgoff)
+ if ((vmf->pgoff | PG_PMD_COLOUR) > max_pgoff)
goto fallback;
/*
@@ -1360,7 +1359,7 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
* setting up a mapping, so really we're using iomap_begin() as a way
* to look up our filesystem block.
*/
- pos = (loff_t)pgoff << PAGE_SHIFT;
+ pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
if (error)
goto fallback;
@@ -1370,28 +1369,24 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
* the tree, for instance), it will return -EEXIST and we just fall
* back to 4k entries.
*/
- entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
+ entry = grab_mapping_entry(mapping, vmf->pgoff, RADIX_DAX_PMD);
if (IS_ERR(entry))
goto finish_iomap;
if (iomap.offset + iomap.length < pos + PMD_SIZE)
goto unlock_entry;
- vmf.pgoff = pgoff;
- vmf.flags = flags;
- vmf.gfp_mask = mapping_gfp_mask(mapping) | __GFP_IO;
-
switch (iomap.type) {
case IOMAP_MAPPED:
- result = dax_pmd_insert_mapping(vma, pmd, &vmf, address,
- &iomap, pos, write, &entry);
+ result = dax_pmd_insert_mapping(vma, vmf->pmd, vmf,
+ vmf->address, &iomap, pos, write, &entry);
break;
case IOMAP_UNWRITTEN:
case IOMAP_HOLE:
if (WARN_ON_ONCE(write))
goto unlock_entry;
- result = dax_pmd_load_hole(vma, pmd, &vmf, address, &iomap,
- &entry);
+ result = dax_pmd_load_hole(vma, vmf->pmd, vmf, vmf->address,
+ &iomap, &entry);
break;
default:
WARN_ON_ONCE(1);
@@ -1399,7 +1394,7 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
}
unlock_entry:
- put_locked_mapping_entry(mapping, pgoff, entry);
+ put_locked_mapping_entry(mapping, vmf->pgoff, entry);
finish_iomap:
if (ops->iomap_end) {
int copied = PMD_SIZE;
@@ -1417,12 +1412,11 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
}
fallback:
if (result == VM_FAULT_FALLBACK) {
- split_huge_pmd(vma, pmd, address);
+ split_huge_pmd(vma, vmf->pmd, vmf->address);
count_vm_event(THP_FAULT_FALLBACK);
}
out:
- trace_dax_pmd_fault_done(inode, vma, address, flags, pgoff, max_pgoff,
- result);
+ trace_dax_pmd_fault_done(inode, vma, vmf, max_pgoff, result);
return result;
}
EXPORT_SYMBOL_GPL(dax_iomap_pmd_fault);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index d663d3d..10b64ba 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -275,21 +275,20 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
return result;
}
-static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
- pmd_t *pmd, unsigned int flags)
+static int
+ext4_dax_pmd_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
int result;
struct inode *inode = file_inode(vma->vm_file);
struct super_block *sb = inode->i_sb;
- bool write = flags & FAULT_FLAG_WRITE;
+ bool write = vmf->flags & FAULT_FLAG_WRITE;
if (write) {
sb_start_pagefault(sb);
file_update_time(vma->vm_file);
}
down_read(&EXT4_I(inode)->i_mmap_sem);
- result = dax_iomap_pmd_fault(vma, addr, pmd, flags,
- &ext4_iomap_ops);
+ result = dax_iomap_pmd_fault(vma, vmf, &ext4_iomap_ops);
up_read(&EXT4_I(inode)->i_mmap_sem);
if (write)
sb_end_pagefault(sb);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index d818c16..4f65a9d 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1526,9 +1526,7 @@ xfs_filemap_fault(
STATIC int
xfs_filemap_pmd_fault(
struct vm_area_struct *vma,
- unsigned long addr,
- pmd_t *pmd,
- unsigned int flags)
+ struct vm_fault *vmf)
{
struct inode *inode = file_inode(vma->vm_file);
struct xfs_inode *ip = XFS_I(inode);
@@ -1539,16 +1537,16 @@ xfs_filemap_pmd_fault(
trace_xfs_filemap_pmd_fault(ip);
- if (flags & FAULT_FLAG_WRITE) {
+ if (vmf->flags & FAULT_FLAG_WRITE) {
sb_start_pagefault(inode->i_sb);
file_update_time(vma->vm_file);
}
xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
- ret = dax_iomap_pmd_fault(vma, addr, pmd, flags, &xfs_iomap_ops);
+ ret = dax_iomap_pmd_fault(vma, vmf, &xfs_iomap_ops);
xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
- if (flags & FAULT_FLAG_WRITE)
+ if (vmf->flags & FAULT_FLAG_WRITE)
sb_end_pagefault(inode->i_sb);
return ret;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 6e36b11..9761c90 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -71,16 +71,15 @@ static inline unsigned int dax_radix_order(void *entry)
return PMD_SHIFT - PAGE_SHIFT;
return 0;
}
-int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
- pmd_t *pmd, unsigned int flags, struct iomap_ops *ops);
+int dax_iomap_pmd_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+ struct iomap_ops *ops);
#else
static inline unsigned int dax_radix_order(void *entry)
{
return 0;
}
static inline int dax_iomap_pmd_fault(struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd, unsigned int flags,
- struct iomap_ops *ops)
+ struct vm_fault *vmf, struct iomap_ops *ops)
{
return VM_FAULT_FALLBACK;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 30f416a..aef645b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -347,8 +347,7 @@ struct vm_operations_struct {
void (*close)(struct vm_area_struct * area);
int (*mremap)(struct vm_area_struct * area);
int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
- int (*pmd_fault)(struct vm_area_struct *, unsigned long address,
- pmd_t *, unsigned int flags);
+ int (*pmd_fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
void (*map_pages)(struct vm_fault *vmf,
pgoff_t start_pgoff, pgoff_t end_pgoff);
diff --git a/include/trace/events/fs_dax.h b/include/trace/events/fs_dax.h
index c3b0aae..a98665b 100644
--- a/include/trace/events/fs_dax.h
+++ b/include/trace/events/fs_dax.h
@@ -8,9 +8,8 @@
DECLARE_EVENT_CLASS(dax_pmd_fault_class,
TP_PROTO(struct inode *inode, struct vm_area_struct *vma,
- unsigned long address, unsigned int flags, pgoff_t pgoff,
- pgoff_t max_pgoff, int result),
- TP_ARGS(inode, vma, address, flags, pgoff, max_pgoff, result),
+ struct vm_fault *vmf, pgoff_t max_pgoff, int result),
+ TP_ARGS(inode, vma, vmf, max_pgoff, result),
TP_STRUCT__entry(
__field(unsigned long, ino)
__field(unsigned long, vm_start)
@@ -29,9 +28,9 @@ DECLARE_EVENT_CLASS(dax_pmd_fault_class,
__entry->vm_start = vma->vm_start;
__entry->vm_end = vma->vm_end;
__entry->vm_flags = vma->vm_flags;
- __entry->address = address;
- __entry->flags = flags;
- __entry->pgoff = pgoff;
+ __entry->address = vmf->address;
+ __entry->flags = vmf->flags;
+ __entry->pgoff = vmf->pgoff;
__entry->max_pgoff = max_pgoff;
__entry->result = result;
),
@@ -54,9 +53,9 @@ DECLARE_EVENT_CLASS(dax_pmd_fault_class,
#define DEFINE_PMD_FAULT_EVENT(name) \
DEFINE_EVENT(dax_pmd_fault_class, name, \
TP_PROTO(struct inode *inode, struct vm_area_struct *vma, \
- unsigned long address, unsigned int flags, pgoff_t pgoff, \
+ struct vm_fault *vmf, \
pgoff_t max_pgoff, int result), \
- TP_ARGS(inode, vma, address, flags, pgoff, max_pgoff, result))
+ TP_ARGS(inode, vma, vmf, max_pgoff, result))
DEFINE_PMD_FAULT_EVENT(dax_pmd_fault);
DEFINE_PMD_FAULT_EVENT(dax_pmd_fault_done);
diff --git a/mm/memory.c b/mm/memory.c
index e37250f..8ec36cf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3447,8 +3447,7 @@ static int create_huge_pmd(struct vm_fault *vmf)
if (vma_is_anonymous(vma))
return do_huge_pmd_anonymous_page(vmf);
if (vma->vm_ops->pmd_fault)
- return vma->vm_ops->pmd_fault(vma, vmf->address, vmf->pmd,
- vmf->flags);
+ return vma->vm_ops->pmd_fault(vma, vmf);
return VM_FAULT_FALLBACK;
}
@@ -3457,8 +3456,7 @@ static int wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
if (vma_is_anonymous(vmf->vma))
return do_huge_pmd_wp_page(vmf, orig_pmd);
if (vmf->vma->vm_ops->pmd_fault)
- return vmf->vma->vm_ops->pmd_fault(vmf->vma, vmf->address,
- vmf->pmd, vmf->flags);
+ return vmf->vma->vm_ops->pmd_fault(vmf->vma, vmf);
/* COW handled on pte level: split pmd */
VM_BUG_ON_VMA(vmf->vma->vm_flags & VM_SHARED, vmf->vma);
5 years, 5 months
LTP rwtest01 blocks on DAX mountpoint
by Xiong Zhou
Hi lists,
Since around 20161129 tag, LTP rwtest01 on dax mountpoint blocks
on linux-next tree, now on Linus tree.
In "normal", rwtest01 subcase ends in a few minutes, now it keeps
running for hours on dax mountpoint, both ext4 and xfs. Ctrl + c
can interrupt it.
It is always reproducible, blocking following tests.
It does not happen when mounting without dax option.
It does not happen on v4.9.
Bisect point to:
commit 4b4bb46d00b386e1c972890dc5785a7966eaa9c0
Author: Jan Kara <jack(a)suse.cz>
Date: Wed Dec 14 15:07:53 2016 -0800
dax: clear dirty entry tags on cache flush
Reverting this commit on top of Linus tree "fixes" this issue.
Reproducer:
sh-4.2# cat rwt
rwtest01 export LTPROOT; rwtest -N rwtest01 -c -q -i 60s -f sync 10%25000:$TMPDIR/rw-sync-$$
sh-4.2#
mkfs.xfs /dev/pmem0p1
mount -o dax /dev/pmem0p1 /daxmnt && \
/opt/ltp/runltp -q -d /daxmnt -f rwt -p -b /dev/pmem0p2 -B xfs
umount /daxmnt
Bisect log is attached.
Thanks,
Xiong
5 years, 5 months
multi-threads libvmmalloc fork test hang
by Xiong Zhou
# description
nvml test suite vmmalloc_fork test hang.
$ ps -eo stat,comm | grep vmma
S+ vmmalloc_fork
Sl+ vmmalloc_fork
Z+ vmmalloc_fork <defunct>
Sl+ vmmalloc_fork
Z+ vmmalloc_fork <defunct>
Z+ vmmalloc_fork <defunct>
Sl+ vmmalloc_fork
Z+ vmmalloc_fork <defunct>
Z+ vmmalloc_fork <defunct>
Z+ vmmalloc_fork <defunct>
dmesg:
[ 250.499097] INFO: task vmmalloc_fork:9805 blocked for more than 120 seconds.
[ 250.530667] Not tainted 4.9.09fe68ca+ #27
[ 250.550901] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 250.585752] vmmalloc_fork D[ 250.598362] ffffffff8171813c 0 9805 9765 0x00000080
[ 250.623445] ffff88075dc68f80[ 250.636052] 0000000000000000 ffff88076058db00 ffff88017c5b0000 ffff880763b19340[ 250.668510] ffffc9000fe1bbb0 ffffffff8171813c ffffc9000fe1bc20 ffffc9000fe1bbe0[ 250.704220] ffffffff82248898 ffff88076058db00 ffffffff82248898Call Trace:
[ 250.738382] [<ffffffff8171813c>] ? __schedule+0x21c/0x6a0
[ 250.763404] [<ffffffff817185f6>] schedule+0x36/0x80
[ 250.786177] [<ffffffff81284471>] get_unlocked_mapping_entry+0xc1/0x120
[ 250.815869] [<ffffffff81283810>] ? iomap_dax_rw+0x110/0x110
[ 250.841350] [<ffffffff81284c0a>] grab_mapping_entry+0x4a/0x220
[ 250.868442] [<ffffffff812851e9>] iomap_dax_fault+0xa9/0x3b0
[ 250.894437] [<ffffffffa02b15fe>] xfs_filemap_fault+0xce/0xf0 [xfs]
[ 250.922805] [<ffffffff811d3159>] __do_fault+0x79/0x100
[ 250.947035] [<ffffffff811d7a2b>] do_fault+0x49b/0x690
[ 250.970964] [<ffffffffa02b146c>] ? xfs_filemap_pmd_fault+0x9c/0x160 [xfs]
[ 251.001812] [<ffffffff811d94ba>] handle_mm_fault+0x61a/0xa50
[ 251.027736] [<ffffffff8106c3da>] __do_page_fault+0x22a/0x4a0
[ 251.053700] [<ffffffff8106c680>] do_page_fault+0x30/0x80
[ 251.077962] [<ffffffff81003b55>] ? do_syscall_64+0x175/0x180
[ 251.103835] [<ffffffff8171e208>] page_fault+0x28/0x30
# kernel versions:
v4.6 pass in seconds
v4.7 hang
v4.9-rc1 hang
Linus tree to commit 9fe68ca hang
bisect points to
first bad commit: [ac401cc782429cc8560ce4840b1405d603740917] dax: New fault locking
v4.7 with these 3 commits reverted pass:
4d9a2c8 - Jan Kara, 6 months ago : dax: Remove i_mmap_lock protection
bc2466e - Jan Kara, 6 months ago : dax: Use radix tree entry lock to protect cow faults
ac401cc - Jan Kara, 6 months ago : dax: New fault locking
# nvml version:
https://github.com/pmem/nvml.git
to commit:
feab4d6f65102139ce460890c898fcad09ce20ae
# How reproducible:
always
# Test steps:
<git clone and pmem0 setup>
$cd nvml
$make install -j64
$cat > src/test/testconfig.sh <<EOF
PMEM_FS_DIR=/daxmnt
NON_PMEM_FS_DIR=/tmp
EOF
$mkfs.xfs /dev/pmem0
$mkdir -p /daxmnt/
$mount -o dax /dev/pmem0 /daxmnt/
$make -C src/test/vmmalloc_fork/ TEST_TIME=60m clean
$make -C src/test/vmmalloc_fork/ TEST_TIME=60m check
$umount /daxmnt
5 years, 5 months
[PATCH v2 0/3] use nocache copy in copy_from_iter_nocache()
by Brian Boylston
Currently, copy_from_iter_nocache() uses "nocache" copies only for
iovecs; bvecs and kvecs use normal copies. This requires
x86's arch_copy_from_iter_pmem() to issue flushes for bvecs and kvecs,
which has a negative impact on performance when splice()ing from a pipe
to a pmem-backed file on a DAX-mounted file system.
This patch set enables nocache copies in copy_from_iter_nocache() for
bvecs and kvecs for arches that support it (x86 initially). This provides
a 2-3X improvement in splice() pipe-to-DAX-file throughput.
The first patch introduces memcpy_nocache(), which defaults to just
memcpy(), but for which an x86-specific implementation is provided.
For this patch, I sought to use a static inline function for x86, but
I could not find an obvious header file to put it in.
The build seemed to work when I put it in arch/x86/include/asm/uaccess.h,
but that didn't feel completely right. I also tried
arch/x86/include/asm/pmem.h, but that doesn't feel right either and it
didn't build. So, I offer it here in arch/x86/lib/misc.c for discussion.
The second patch updates copy_from_iter_nocache() to use the new
memcpy_nocache().
The third patch removes the flushes from x86's arch_copy_from_iter_pmem().
For testing, I ran fio with the posixaio, mmap, sync, psync, vsync, pvsync,
and splice engines, against both ext4 and xfs. Only the splice engine
showed any change in performance. For example, for xfs:
Unpatched 4.8:
Run status group 2 (all jobs):
WRITE: io=37602MB, aggrb=641724KB/s, minb=641724KB/s, maxb=641724KB/s, mint=60001msec, maxt=60001msec
Run status group 3 (all jobs):
WRITE: io=36244MB, aggrb=618553KB/s, minb=618553KB/s, maxb=618553KB/s, mint=60001msec, maxt=60001msec
With this patch set:
Run status group 2 (all jobs):
WRITE: io=128055MB, aggrb=2134.3MB/s, minb=2134.3MB/s, maxb=2134.3MB/s, mint=60001msec, maxt=60001msec
Run status group 3 (all jobs):
WRITE: io=122586MB, aggrb=2043.8MB/s, minb=2043.8MB/s, maxb=2043.8MB/s, mint=60001msec, maxt=60001msec
Cc: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: <x86(a)kernel.org>
Cc: Al Viro <viro(a)ZenIV.linux.org.uk>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Signed-off-by: Brian Boylston <brian.boylston(a)hpe.com>
Reviewed-by: Toshi Kani <toshi.kani(a)hpe.com>
Reported-by: Oliver Moreno <oliver.moreno(a)hpe.com>
Changes in v2:
- Split into multiple patches (Toshi Kani)
- Introduce memcpy_nocache() (Al Viro)
- Use nocache for kvecs as well
Brian Boylston (3):
introduce memcpy_nocache()
use a nocache copy for bvecs and kvecs in copy_from_iter_nocache()
x86: remove unneeded flush in arch_copy_from_iter_pmem()
arch/x86/include/asm/pmem.h | 19 +------------------
arch/x86/include/asm/string_32.h | 3 +++
arch/x86/include/asm/string_64.h | 3 +++
arch/x86/lib/misc.c | 12 ++++++++++++
include/linux/string.h | 15 +++++++++++++++
lib/iov_iter.c | 14 +++++++++++---
6 files changed, 45 insertions(+), 21 deletions(-)
--
2.8.3
5 years, 5 months
[PATCH] x86: fix kaslr and memmap collision
by Dave Jiang
CONFIG_RANDOMIZE_BASE relocates the kernel to a random base address.
However it does not take into account the memmap= parameter passed in from
the kernel commandline. This results in the kernel sometimes being put in
the middle of the user memmap. Check has been added in the kaslr in order
to avoid the region marked by memmap.
Signed-off-by: Dave Jiang <dave.jiang(a)intel.com>
---
arch/x86/boot/boot.h | 2 ++
arch/x86/boot/compressed/kaslr.c | 45 ++++++++++++++++++++++++++++++++++++++
arch/x86/boot/string.c | 25 +++++++++++++++++++++
3 files changed, 72 insertions(+)
diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h
index e5612f3..0d5fe5b 100644
--- a/arch/x86/boot/boot.h
+++ b/arch/x86/boot/boot.h
@@ -332,6 +332,8 @@ int strncmp(const char *cs, const char *ct, size_t count);
size_t strnlen(const char *s, size_t maxlen);
unsigned int atou(const char *s);
unsigned long long simple_strtoull(const char *cp, char **endp, unsigned int base);
+unsigned long simple_strtoul(const char *cp, char **endp, unsigned int base);
+long simple_strtol(const char *cp, char **endp, unsigned int base);
size_t strlen(const char *s);
/* tty.c */
diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index a66854d..6fb8f1ec 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -11,6 +11,7 @@
*/
#include "misc.h"
#include "error.h"
+#include "../boot.h"
#include <generated/compile.h>
#include <linux/module.h>
@@ -61,6 +62,7 @@ enum mem_avoid_index {
MEM_AVOID_INITRD,
MEM_AVOID_CMDLINE,
MEM_AVOID_BOOTPARAMS,
+ MEM_AVOID_MEMMAP,
MEM_AVOID_MAX,
};
@@ -77,6 +79,37 @@ static bool mem_overlaps(struct mem_vector *one, struct mem_vector *two)
return true;
}
+#include "../../../../lib/cmdline.c"
+
+static int
+parse_memmap(char *p, unsigned long long *start, unsigned long long *size)
+{
+ char *oldp;
+
+ if (!p)
+ return -EINVAL;
+
+ /* we don't care about this option here */
+ if (!strncmp(p, "exactmap", 8))
+ return -EINVAL;
+
+ oldp = p;
+ *size = memparse(p, &p);
+ if (p == oldp)
+ return -EINVAL;
+
+ switch (*p) {
+ case '@':
+ case '#':
+ case '$':
+ case '!':
+ *start = memparse(p+1, &p);
+ return 0;
+ }
+
+ return -EINVAL;
+}
+
/*
* In theory, KASLR can put the kernel anywhere in the range of [16M, 64T).
* The mem_avoid array is used to store the ranges that need to be avoided
@@ -158,6 +191,8 @@ static void mem_avoid_init(unsigned long input, unsigned long input_size,
u64 initrd_start, initrd_size;
u64 cmd_line, cmd_line_size;
char *ptr;
+ char arg[38];
+ unsigned long long memmap_start, memmap_size;
/*
* Avoid the region that is unsafe to overlap during
@@ -195,6 +230,16 @@ static void mem_avoid_init(unsigned long input, unsigned long input_size,
add_identity_map(mem_avoid[MEM_AVOID_BOOTPARAMS].start,
mem_avoid[MEM_AVOID_BOOTPARAMS].size);
+ /* see if we have any memmap areas */
+ if (cmdline_find_option("memmap", arg, sizeof(arg)) > 0) {
+ int rc = parse_memmap(arg, &memmap_start, &memmap_size);
+
+ if (!rc) {
+ mem_avoid[MEM_AVOID_MEMMAP].start = memmap_start;
+ mem_avoid[MEM_AVOID_MEMMAP].size = memmap_size;
+ }
+ }
+
/* We don't need to set a mapping for setup_data. */
#ifdef CONFIG_X86_VERBOSE_BOOTUP
diff --git a/arch/x86/boot/string.c b/arch/x86/boot/string.c
index cc3bd58..7a376c1 100644
--- a/arch/x86/boot/string.c
+++ b/arch/x86/boot/string.c
@@ -122,6 +122,31 @@ unsigned long long simple_strtoull(const char *cp, char **endp, unsigned int bas
}
/**
+ * simple_strtoul - convert a string to an unsigned long
+ * @cp: The start of the string
+ * @endp: A pointer to the end of the parsed string will be placed here
+ * @base: The number base to use
+ */
+unsigned long simple_strtoul(const char *cp, char **endp, unsigned int base)
+{
+ return simple_strtoull(cp, endp, base);
+}
+
+/**
+ * simple_strtol - convert a string to a signed long
+ * @cp: The start of the string
+ * @endp: A pointer to the end of the parsed string will be placed here
+ * @base: The number base to use
+ */
+long simple_strtol(const char *cp, char **endp, unsigned int base)
+{
+ if (*cp == '-')
+ return -simple_strtoul(cp + 1, endp, base);
+
+ return simple_strtoul(cp, endp, base);
+}
+
+/**
* strlen - Find the length of a string
* @s: The string to be sized
*/
5 years, 5 months