[PATCH V2 1/1] device-dax: check for vma range while dax_mmap.
by Zhang Yi
This patch prevents a user mapping an illegal vma range that is larger
than a dax device physical resource.
When qemu maps the dax device for virtual nvdimm's backend device, the
v-nvdimm label area is defined at the end of mapped range. By using an
illegal size that exceeds the range of the device dax, it will trigger a
fault with qemu.
Signed-off-by: Zhang Yi <yi.z.zhang(a)linux.intel.com>
---
drivers/dax/device.c | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 108c37f..6fe8c30 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -177,6 +177,33 @@ static const struct attribute_group *dax_attribute_groups[] = {
NULL,
};
+static int check_vma_range(struct dev_dax *dev_dax, struct vm_area_struct *vma,
+ const char *func)
+{
+ struct device *dev = &dev_dax->dev;
+ struct resource *res;
+ unsigned long size;
+ int ret, i;
+
+ if (!dax_alive(dev_dax->dax_dev))
+ return -ENXIO;
+
+ size = vma->vm_end - vma->vm_start + (vma->vm_pgoff << PAGE_SHIFT);
+ ret = -EINVAL;
+ for (i = 0; i < dev_dax->num_resources; i++) {
+ res = &dev_dax->res[i];
+ if (size > resource_size(res)) {
+ dev_info_ratelimited(dev,
+ "%s: %s: fail, vma range overflow\n",
+ current->comm, func);
+ ret = -EINVAL;
+ continue;
+ } else
+ return 0;
+ }
+ return ret;
+}
+
static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
const char *func)
{
@@ -469,6 +496,8 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
*/
id = dax_read_lock();
rc = check_vma(dev_dax, vma, __func__);
+ if (!rc)
+ rc = check_vma_range(dev_dax, vma, __func__);
dax_read_unlock(id);
if (rc)
return rc;
--
2.7.4
3 years, 6 months
Snapshot target and DAX-capable devices
by Jan Kara
Hi,
I've been analyzing why fstest generic/081 fails when the backing device is
capable of DAX. The problem boils down to the failure of:
lvm vgcreate -f vg0 /dev/pmem0
lvm lvcreate -L 128M -n lv0 vg0
lvm lvcreate -s -L 4M -n snap0 vg0/lv0
The last command fails like:
device-mapper: reload ioctl on (253:0) failed: Invalid argument
Failed to lock logical volume vg0/lv0.
Aborting. Manual intervention required.
And the core of the problem is that volume vg0/lv0 is originally of
DM_TYPE_DAX_BIO_BASED type but when the snapshot gets created, we try to
switch it to DM_TYPE_BIO_BASED because now the device stops supporting DAX.
The problem seems to be introduced by Ross' commit dbc626597 "dm: prevent
DAX mounts if not supported".
The question is whether / how this should be fixed. The current inability
to create snapshots of DAX-capable devices looks weird and the cryptic
failure makes it even worse (it took me quite a while to understand what is
failing and why). OTOH I see the rationale behind Ross' change as well.
Honza
--
Jan Kara <jack(a)suse.com>
SUSE Labs, CR
3 years, 6 months
[ndctl PATCH v13 0/5] ndctl, monitor: add ndctl monitor daemon
by QI Fuli
This is the v13 patch for ndctl monitor, a tiny daemon to monitor
the smart events of nvdimm DIMMs. Since NVDIMM does not have a
feature like mirroring, if it breaks down, the data will be
impossible to restore. Ndctl monitor daemon will catch the smart
events notify from firmware and outputs notification to logfile,
therefore users can replace NVDIMM before it is completely broken.
Signed-off-by: QI Fuli <qi.fuli(a)jp.fujitsu.com>
---
Change log since v12:
- Fixing log_fn() for removing output new line
- Fixing hard code default configuration file path
- Fixing RPM spec file for configuration file and systemd unit file
- Fixing man page
Change log since v11:
- Adding log_standard()
- Adding [-u | --human] option
- Fixing man page
- Refactoring unit test
- Updating configuration file and systemd unit file to RPM spec file
Change log since v10:
- Adding unit test
- Adding fflush to log_file()
Change log since v9:
- Replacing ndctl_cmd_smart_get_event_flags() with
ndctl_dimm_get_event_flags()
- Adding ndctl_dimm_get_health() api
- Adding ndctl_dimm_get_flags() api
- Adding ndctl_dimm_is_flag_supported api
- Adding manpage
Change log since v8:
- Adding ndctl_cmd_smart_get_event_flags() api
- Adding monitor_filter_arg to the union in util_filter_ctx
- Removing is_dir()
- Replacing malloc + vsprintf with vasprintf() in log_file() and log_syslog()
- Adding parse_monitor_event()
- Refactoring util_dimm_event_filter()
- Adding event_flags to monitor
- Refactoring dimm_event_to_json()
- Adding check_dimm_supported_threshold_alarms()
- Fixing fail token
Change log since v7:
- Replacing logreport() with log_file() and log_syslog()
- Refactoring read_config_file()
- Replacing set_confile() with parse_config()
- Fixing the ndctl/ndct.conf file
Change log since v6:
- Changing License to GPL-2.0
- Adding event object to output notification
- Adding [--dimm-event] option to filter notification by event type
- Rewriting read_config_file()
- Replacing monitor_dimm_event() with monitor_event()
- Renaming some variables
Change log since v5:
- Fixing systemd unit file cannot be installed bug
- Adding license to ./util/abspath.c
Change log since v4:
- Adding OPTION_FILENAME to make sure filename is correct
- Adding configuration file
- Adding [--config-file] option to override the default configuration
- Making some options support multiple space-seperated arguments
- Making systemctl enable ndctl-monitor.service command work
- Making systemctl restart ndctl-monitor.service command work
- Making the directory of systemd unit file to be configurable
- Changing log_file() and log_syslog() to logreport()
- Changing date format in notification to nanoseconds since epoch
- Changing select() to epoll()
- Adding filter_bus() and filter_region()
Change log since v3:
- Removing create-monitor, show-monitor, list-monitor, destroy-monitor
- Adding [--daemon] option to run ndctl monitor as a daemon
- Using systemd to manage ndctl monitor daemon
- Replacing filter_monitor_dimm() with filter_dimm()
Change log since v2:
- Changing the interface of daemon to the ndctl command line
- Changing the name of daemon form "nvdimmd" to "monitor"
- Removing the config file, unit_file, nvdimmd dir
- Removing nvdimmd_test program
- Adding ndctl/monitor.c
Change log since v1:
- Adding a config file(/etc/nvdimmd/nvdimmd.conf)
- Using struct log_ctx instead of syslog()
- Using log_syslog() to save the notify messages to syslog
- Using log_file() to save the notify messages to special file
- Adding LOG_NOTICE level to log_priority
- Using automake instead of Makefile
- Adding a new util file(nvdimmd/util.c) including helper functions
needed for nvdimm daemon
- Adding nvdimmd_test program
QI Fuli (5):
ndctl, monitor: add a new command - monitor
ndctl, monitor: add main ndctl monitor configuration file
ndctl, monitor: add the unit file of systemd for ndctl-monitor service
ndctl, documentation: add man page for monitor
ndctl, test: add a new unit test for monitor
.gitignore | 1 +
Documentation/ndctl/Makefile.am | 3 +-
Documentation/ndctl/ndctl-monitor.txt | 108 +++++
autogen.sh | 3 +-
builtin.h | 1 +
configure.ac | 23 +
ndctl.spec.in | 3 +
ndctl/Makefile.am | 12 +-
ndctl/lib/libndctl.c | 82 ++++
ndctl/lib/libndctl.sym | 4 +
ndctl/libndctl.h | 10 +
ndctl/monitor.c | 650 ++++++++++++++++++++++++++
ndctl/monitor.conf | 41 ++
ndctl/ndctl-monitor.service | 7 +
ndctl/ndctl.c | 1 +
test/Makefile.am | 14 +-
test/list-smart-dimm.c | 117 +++++
test/monitor.sh | 176 +++++++
util/filter.h | 9 +
19 files changed, 1260 insertions(+), 5 deletions(-)
create mode 100644 Documentation/ndctl/ndctl-monitor.txt
create mode 100644 ndctl/monitor.c
create mode 100644 ndctl/monitor.conf
create mode 100644 ndctl/ndctl-monitor.service
create mode 100644 test/list-smart-dimm.c
create mode 100755 test/monitor.sh
--
2.18.0
3 years, 7 months
[PATCH 0/3] kvm "fake DAX" device
by Pankaj Gupta
This patch series has implementation for "fake DAX".
"fake DAX" is fake persistent memory(nvdimm) in guest
which allows to bypass the guest page cache. This also
implements a VIRTIO based asynchronous flush mechanism.
Sharing guest driver and qemu device changes in separate
patch sets for easy review and it has been tested together.
Details of project idea for 'fake DAX' flushing interface
is shared [2] & [3].
Implementation is divided into two parts:
New virtio pmem guest driver and qemu code changes for new
virtio pmem paravirtualized device.
1. Guest virtio-pmem kernel driver
---------------------------------
- Reads persistent memory range from paravirt device and
registers with 'nvdimm_bus'.
- 'nvdimm/pmem' driver uses this information to allocate
persistent memory region and setup filesystem operations
to the allocated memory.
- virtio pmem driver implements asynchronous flushing
interface to flush from guest to host.
2. Qemu virtio-pmem device
---------------------------------
- Creates virtio pmem device and exposes a memory range to
KVM guest.
- At host side this is file backed memory which acts as
persistent memory.
- Qemu side flush uses aio thread pool API's and virtio
for asynchronous guest multi request handling.
David Hildenbrand CCed also posted a modified version[4] of
qemu virtio-pmem code based on updated Qemu memory device API.
Virtio-pmem errors handling:
----------------------------------------
Checked behaviour of virtio-pmem for below types of errors
Need suggestions on expected behaviour for handling these errors?
- Hardware Errors: Uncorrectable recoverable Errors:
a] virtio-pmem:
- As per current logic if error page belongs to Qemu process,
host MCE handler isolates(hwpoison) that page and send SIGBUS.
Qemu SIGBUS handler injects exception to KVM guest.
- KVM guest then isolates the page and send SIGBUS to guest
userspace process which has mapped the page.
b] Existing implementation for ACPI pmem driver:
- Handles such errors with MCE notifier and creates a list
of bad blocks. Read/direct access DAX operation return EIO
if accessed memory page fall in bad block list.
- It also starts backgound scrubbing.
- Similar functionality can be reused in virtio-pmem with MCE
notifier but without scrubbing(no ACPI/ARS)? Need inputs to
confirm if this behaviour is ok or needs any change?
Changes from RFC v3: [1]
- Rebase to latest upstream - Luiz
- Call ndregion->flush in place of nvdimm_flush- Luiz
- kmalloc return check - Luiz
- virtqueue full handling - Stefan
- Don't map entire virtio_pmem_req to device - Stefan
- request leak,correct sizeof req- Stefan
- Move declaration to virtio_pmem.c
Changes from RFC v2:
- Add flush function in the nd_region in place of switching
on a flag - Dan & Stefan
- Add flush completion function with proper locking and wait
for host side flush completion - Stefan & Dan
- Keep userspace API in uapi header file - Stefan, MST
- Use LE fields & New device id - MST
- Indentation & spacing suggestions - MST & Eric
- Remove extra header files & add licensing - Stefan
Changes from RFC v1:
- Reuse existing 'pmem' code for registering persistent
memory and other operations instead of creating an entirely
new block driver.
- Use VIRTIO driver to register memory information with
nvdimm_bus and create region_type accordingly.
- Call VIRTIO flush from existing pmem driver.
Pankaj Gupta (3):
nd: move nd_region to common header
libnvdimm: nd_region flush callback support
virtio-pmem: Add virtio-pmem guest driver
[1] https://lkml.org/lkml/2018/7/13/102
[2] https://www.spinics.net/lists/kvm/msg149761.html
[3] https://www.spinics.net/lists/kvm/msg153095.html
[4] https://marc.info/?l=qemu-devel&m=153555721901824&w=2
drivers/acpi/nfit/core.c | 7 -
drivers/nvdimm/claim.c | 3
drivers/nvdimm/nd.h | 39 -----
drivers/nvdimm/pmem.c | 12 +
drivers/nvdimm/region_devs.c | 12 +
drivers/virtio/Kconfig | 9 +
drivers/virtio/Makefile | 1
drivers/virtio/virtio_pmem.c | 255 +++++++++++++++++++++++++++++++++++++++
include/linux/libnvdimm.h | 4
include/linux/nd.h | 40 ++++++
include/uapi/linux/virtio_ids.h | 1
include/uapi/linux/virtio_pmem.h | 40 ++++++
12 files changed, 374 insertions(+), 49 deletions(-)
3 years, 9 months
[PATCH v5 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages
by Dan Williams
Changes since v4 [1]:
* Rework dax_lock_page() to reuse get_unlocked_mapping_entry() (Jan)
* Change the calling convention to take a 'struct page *' and return
success / failure instead of performing the pfn_to_page() internal to
the api (Jan, Ross).
* Rename dax_lock_page() to dax_lock_mapping_entry() (Jan)
* Account for the case that a given pfn can be fsdax mapped with
different sizes in different vmas (Jan)
* Update collect_procs() to determine the mapping size of the pfn for
each page given it can be variable in the dax case.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2018-June/016279.html
---
As it stands, memory_failure() gets thoroughly confused by dev_pagemap
backed mappings. The recovery code has specific enabling for several
possible page states and needs new enabling to handle poison in dax
mappings.
In order to support reliable reverse mapping of user space addresses:
1/ Add new locking in the memory_failure() rmap path to prevent races
that would typically be handled by the page lock.
2/ Since dev_pagemap pages are hidden from the page allocator and the
"compound page" accounting machinery, add a mechanism to determine the
size of the mapping that encompasses a given poisoned pfn.
3/ Given pmem errors can be repaired, change the speculatively accessed
poison protection, mce_unmap_kpfn(), to be reversible and otherwise
allow ongoing access from the kernel.
A side effect of this enabling is that MADV_HWPOISON becomes usable for
dax mappings, however the primary motivation is to allow the system to
survive userspace consumption of hardware-poison via dax. Specifically
the current behavior is:
mce: Uncorrected hardware memory error in user-access at af34214200
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
mce: [Hardware Error]: Machine check events logged
{1}[Hardware Error]: event severity: corrected
Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
[..]
Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
mce: Memory error not recovered
<reboot>
...and with these changes:
Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
Memory failure: 0x20cb00: recovery action for dax page: Recovered
Given all the cross dependencies I propose taking this through
nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax
folks.
---
Dan Williams (11):
device-dax: Convert to vmf_insert_mixed and vm_fault_t
device-dax: Enable page_mapping()
device-dax: Set page->index
filesystem-dax: Set page->index
mm, madvise_inject_error: Let memory_failure() optionally take a page reference
mm, memory_failure: Collect mapping size in collect_procs()
filesystem-dax: Introduce dax_lock_mapping_entry()
mm, memory_failure: Teach memory_failure() about dev_pagemap pages
x86/mm/pat: Prepare {reserve,free}_memtype() for "decoy" addresses
x86/memory_failure: Introduce {set,clear}_mce_nospec()
libnvdimm, pmem: Restore page attributes when clearing errors
arch/x86/include/asm/set_memory.h | 42 ++++++
arch/x86/kernel/cpu/mcheck/mce-internal.h | 15 --
arch/x86/kernel/cpu/mcheck/mce.c | 38 -----
arch/x86/mm/pat.c | 16 ++
drivers/dax/device.c | 75 +++++++----
drivers/nvdimm/pmem.c | 26 ++++
drivers/nvdimm/pmem.h | 13 ++
fs/dax.c | 125 +++++++++++++++++-
include/linux/dax.h | 24 +++
include/linux/huge_mm.h | 5 -
include/linux/mm.h | 1
include/linux/set_memory.h | 14 ++
mm/huge_memory.c | 4 -
mm/madvise.c | 18 ++-
mm/memory-failure.c | 201 +++++++++++++++++++++++------
15 files changed, 483 insertions(+), 134 deletions(-)
3 years, 9 months
[PATCH v8 00/12] Adding security support for nvdimm
by Dave Jiang
The following series implements security support for nvdimm. Mostly adding
new security DSM support from the Intel NVDIMM DSM spec v1.7, but also
adding generic support libnvdimm for other vendors. The most important
security features are unlocking locked nvdimms, and updating/setting security
passphrase to nvdimms.
Security folks, thanks in advance for taking a look at my key management
implementation and making sure that I'm doing something sane. Mainly you'll
want to review patches 2, 4, and 5 as most relevant ones that need scrutiny.
v8:
- Make the keys retained by the kernel user searchable in order to find the
key that needs to be updated for key update.
v7:
- Add CONFIG_KEYS depenency for libnvdimm. (Alison)
- Export lookup_user_key(). (David)
- Modified "update" to take two key ids and and use lookup_user_key() in
order to improve security. (David)
- Use key ptrs and key_validate() for cached keys. (David)
v6:
- Fix intel DSM data structures to use defined size for passphrase (Robert)
- Fix memcpy size to use sizeof data structure member (Robert)
- Fix defined dimm id length (Robert)
- Making intel_security_ops const (Eric)
- Remove unused var in nvdimm_key_search() (Eric)
- Added wbinvd before secure erase is issued (Robert)
- Removed key_put_sync() usage (David)
- Use init_cred instead of creating own cred (David)
- Exported init_cred symbol
- Move keyring to dedicated (David)
- Use logon_key_type and friends instead of creating custom (David)
- Use key_lookup() with stored key serial (David)
- Exported key_lookup() symbol
- Mark passed in key data as const (David)
- Added comment for change_pass_phrase to explain how it works (David)
- Unlink key when it's being removed from keyring. (David)
- Removed request_key() from all security ops except update and unlock.
- Update will now update the existing key's payload with the new key's
retrieved from userspace when the new payload is accepted by nvdimm.
v5:
- Moved dimm_id initialization (Dan)
- Added a key_put_sync() in order to run key_gc_work and cleanup old key. (Dan)
- Added check to block security state changes while DIMM is active. (Dan)
v4:
- flip payload layout for update passphrase to make it easier on userland.
v3:
- Set x86 wrappers for x86 only bits. (Dan)
- Fixed up some verbiage in commit headers.
- Put in usage of sysfs_streq() for sysfs inputs.
- 0-day build fixes for non-x86 archs.
v2:
- Move inclusion of intel.h to relevant source files and not in nfit.h. (Dan)
- Moved security ring relevant code to dimm_devs.c. (Dan)
- Added dimm_id to nfit_mem to avoid recreate per sysfs show call. (Dan)
- Added routine to return security_ops based on family supplied. (Dan)
- Added nvdimm_key_data struct to wrap raw passphrase string. (Dan)
- Allocate firmware package on stack. (Dan)
- Added missing frozen state detection when retrieving security state.
---
Dave Jiang (12):
nfit: add support for Intel DSM 1.7 commands
libnvdimm: create keyring to store security keys
nfit/libnvdimm: store dimm id as a member to struct nvdimm
keys: export lookup_user_key to external users
nfit/libnvdimm: add unlock of nvdimm support for Intel DIMMs
nfit/libnvdimm: add set passphrase support for Intel nvdimms
nfit/libnvdimm: add disable passphrase support to Intel nvdimm.
nfit/libnvdimm: add freeze security support to Intel nvdimm
nfit/libnvdimm: add support for issue secure erase DSM to Intel nvdimm
nfit_test: add context to dimm_dev for nfit_test
nfit_test: add test support for Intel nvdimm security DSMs
libnvdimm: add documentation for nvdimm security support
Documentation/nvdimm/security.txt | 82 ++++++
drivers/acpi/nfit/Makefile | 1
drivers/acpi/nfit/core.c | 58 +++-
drivers/acpi/nfit/intel.c | 382 +++++++++++++++++++++++++++
drivers/acpi/nfit/intel.h | 82 ++++++
drivers/acpi/nfit/nfit.h | 20 +
drivers/nvdimm/Kconfig | 1
drivers/nvdimm/bus.c | 2
drivers/nvdimm/core.c | 7
drivers/nvdimm/dimm.c | 7
drivers/nvdimm/dimm_devs.c | 529 +++++++++++++++++++++++++++++++++++++
drivers/nvdimm/nd-core.h | 6
drivers/nvdimm/nd.h | 2
include/linux/key.h | 3
include/linux/libnvdimm.h | 42 +++
kernel/cred.c | 1
security/keys/internal.h | 2
security/keys/process_keys.c | 1
tools/testing/nvdimm/Kbuild | 1
tools/testing/nvdimm/test/nfit.c | 227 +++++++++++++++-
20 files changed, 1420 insertions(+), 36 deletions(-)
create mode 100644 Documentation/nvdimm/security.txt
create mode 100644 drivers/acpi/nfit/intel.c
create mode 100644 drivers/acpi/nfit/intel.h
--
3 years, 9 months
[PATCH V4 0/4] Fix kvm misconceives NVDIMM pages as reserved mmio
by Zhang Yi
For device specific memory space, when we move these area of pfn to
memory zone, we will set the page reserved flag at that time, some of
these reserved for device mmio, and some of these are not, such as
NVDIMM pmem.
Now, we map these dev_dax or fs_dax pages to kvm for DIMM/NVDIMM
backend, since these pages are reserved. the check of
kvm_is_reserved_pfn() misconceives those pages as MMIO. Therefor, we
introduce 2 page map types, MEMORY_DEVICE_FS_DAX/MEMORY_DEVICE_DEV_DAX,
to indentify these pages are from NVDIMM pmem. and let kvm treat these
as normal pages.
Without this patch, Many operations will be missed due to this
mistreatment to pmem pages. For example, a page may not have chance to
be unpinned for KVM guest(in kvm_release_pfn_clean); not able to be
marked as dirty/accessed(in kvm_set_pfn_dirty/accessed) etc.
V1:
https://lkml.org/lkml/2018/7/4/91
V2:
https://lkml.org/lkml/2018/7/10/135
V3:
https://lkml.org/lkml/2018/8/9/17
V4:
[PATCH V3 1/4] Added "Reviewed-by: David / Acked-by: Pankaj"
[PATCH V3 2/4] Added "Reviewed-by: Jan"
[PATCH V3 3/4] Added "Acked-by: Jan"
[PATCH V3 4/4] Fix several typos
Zhang Yi (4):
kvm: remove redundant reserved page check
mm: introduce memory type MEMORY_DEVICE_DEV_DAX
mm: add a function to differentiate the pages is from DAX device
memory
kvm: add a check if pfn is from NVDIMM pmem.
drivers/dax/pmem.c | 1 +
include/linux/memremap.h | 8 ++++++++
include/linux/mm.h | 12 ++++++++++++
virt/kvm/kvm_main.c | 16 ++++++++--------
4 files changed, 29 insertions(+), 8 deletions(-)
--
2.7.4
3 years, 9 months
[PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch
by Ross Zwisler
Changes since v3:
* Added an ext4_break_layouts() call to ext4_insert_range() to ensure
that the {ext4,xfs}_break_layouts() calls have the same meaning.
(Dave, Darrick and Jan)
---
This series from Dan:
https://lists.01.org/pipermail/linux-nvdimm/2018-March/014913.html
added synchronization between DAX dma and truncate/hole-punch in XFS.
This short series adds analogous support to ext4.
I've added calls to ext4_break_layouts() everywhere that ext4 removes
blocks from an inode's map.
The timings in XFS are such that it's difficult to hit this race. Dan
was able to show the race by manually introducing delays in the direct
I/O path.
For ext4, though, its trivial to hit this race, and a hit will result in
a trigger of this WARN_ON_ONCE() in dax_disassociate_entry():
WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
I've made an xfstest which tests all the paths where we now call
ext4_break_layouts(). Each of the four paths easily hits this race many
times in my test setup with the xfstest. You can find that test here:
https://lists.01.org/pipermail/linux-nvdimm/2018-June/016435.html
With these patches applied, I've still seen occasional hits of the above
WARN_ON_ONCE(), which tells me that we still have some work to do. I'll
continue looking at these more rare hits.
Ross Zwisler (2):
dax: dax_layout_busy_page() warn on !exceptional
ext4: handle layout changes to pinned DAX mappings
fs/dax.c | 10 +++++++++-
fs/ext4/ext4.h | 1 +
fs/ext4/extents.c | 17 +++++++++++++++++
fs/ext4/inode.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/truncate.h | 4 ++++
5 files changed, 77 insertions(+), 1 deletion(-)
--
2.14.4
3 years, 9 months
[PATCH v2 1/2] ext4: Close race between direct IO and ext4_break_layouts()
by Dave Jiang
From: Ross Zwisler <zwisler(a)kernel.org>
If the refcount of a page is lowered between the time that it is returned
by dax_busy_page() and when the refcount is again checked in
ext4_break_layouts() => ___wait_var_event(), the waiting function
ext4_wait_dax_page() will never be called. This means that
ext4_break_layouts() will still have 'retry' set to false, so we'll stop
looping and never check the refcount of other pages in this inode.
Instead, always continue looping as long as dax_layout_busy_page() gives us
a page which it found with an elevated refcount.
Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Reviewed-by: Jan Kara <jack(a)suse.cz>
---
v2:
- remove verbiage in comment header (Jan)
fs/ext4/inode.c | 9 +++------
1 file changed, 3 insertions(+), 6 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 8f6ad7667974..d2663a1e3ec2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4191,9 +4191,8 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
return 0;
}
-static void ext4_wait_dax_page(struct ext4_inode_info *ei, bool *did_unlock)
+static void ext4_wait_dax_page(struct ext4_inode_info *ei)
{
- *did_unlock = true;
up_write(&ei->i_mmap_sem);
schedule();
down_write(&ei->i_mmap_sem);
@@ -4203,14 +4202,12 @@ int ext4_break_layouts(struct inode *inode)
{
struct ext4_inode_info *ei = EXT4_I(inode);
struct page *page;
- bool retry;
int error;
if (WARN_ON_ONCE(!rwsem_is_locked(&ei->i_mmap_sem)))
return -EINVAL;
do {
- retry = false;
page = dax_layout_busy_page(inode->i_mapping);
if (!page)
return 0;
@@ -4218,8 +4215,8 @@ int ext4_break_layouts(struct inode *inode)
error = ___wait_var_event(&page->_refcount,
atomic_read(&page->_refcount) == 1,
TASK_INTERRUPTIBLE, 0, 0,
- ext4_wait_dax_page(ei, &retry));
- } while (error == 0 && retry);
+ ext4_wait_dax_page(ei));
+ } while (error == 0);
return error;
}
3 years, 9 months
[PATCH v5 00/13] Copy Offload in NVMe Fabrics with P2P PCI Memory
by Logan Gunthorpe
Hi Everyone,
Now that the patchset which creates a command line option to disable
ACS redirection has landed it's time to revisit the P2P patchset for
copy offoad in NVMe fabrics.
I present version 5 wihch no longer does any magic with the ACS bits and
instead will reject P2P transactions between devices that would be affected
by them. A few other cleanups were done which are described in the
changelog below.
This version is based on v4.19-rc1 and a git repo is here:
https://github.com/sbates130272/linux-p2pmem pci-p2p-v5
Thanks,
Logan
--
Changes in v5:
* Rebased on v4.19-rc1
* Drop changing ACS settings in this patchset. Now, the code
will only allow P2P transactions between devices whos
downstream ports do not restrict P2P TLPs.
* Drop the REQ_PCI_P2PDMA block flag and instead use
is_pci_p2pdma_page() to tell if a request is P2P or not. In that
case we check for queue support and enforce using REQ_NOMERGE.
Per feedback from Christoph.
* Drop the pci_p2pdma_unmap_sg() function as it was empty and only
there for symmetry and compatibility with dma_unmap_sg. Per feedback
from Christoph.
* Split off the logic to handle enabling P2P in NVMe fabrics' configfs
into specific helpers in the p2pdma code. Per feedback from Christoph.
* A number of other minor cleanups and fixes as pointed out by
Christoph and others.
Changes in v4:
* Change the original upstream_bridges_match() function to
upstream_bridge_distance() which calculates the distance between two
devices as long as they are behind the same root port. This should
address Bjorn's concerns that the code was to focused on
being behind a single switch.
* The disable ACS function now disables ACS for all bridge ports instead
of switch ports (ie. those that had two upstream_bridge ports).
* Change the pci_p2pmem_alloc_sgl() and pci_p2pmem_free_sgl()
API to be more like sgl_alloc() in that the alloc function returns
the allocated scatterlist and nents is not required bythe free
function.
* Moved the new documentation into the driver-api tree as requested
by Jonathan
* Add SGL alloc and free helpers in the nvmet code so that the
individual drivers can share the code that allocates P2P memory.
As requested by Christoph.
* Cleanup the nvmet_p2pmem_store() function as Christoph
thought my first attempt was ugly.
* Numerous commit message and comment fix-ups
Changes in v3:
* Many more fixes and minor cleanups that were spotted by Bjorn
* Additional explanation of the ACS change in both the commit message
and Kconfig doc. Also, the code that disables the ACS bits is surrounded
explicitly by an #ifdef
* Removed the flag we added to rdma_rw_ctx() in favour of using
is_pci_p2pdma_page(), as suggested by Sagi.
* Adjust pci_p2pmem_find() so that it prefers P2P providers that
are closest to (or the same as) the clients using them. In cases
of ties, the provider is randomly chosen.
* Modify the NVMe Target code so that the PCI device name of the provider
may be explicitly specified, bypassing the logic in pci_p2pmem_find().
(Note: it's still enforced that the provider must be behind the
same switch as the clients).
* As requested by Bjorn, added documentation for driver writers.
Changes in v2:
* Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
as a bunch of cleanup and spelling fixes he pointed out in the last
series.
* To address Alex's ACS concerns, we change to a simpler method of
just disabling ACS behind switches for any kernel that has
CONFIG_PCI_P2PDMA.
* We also reject using devices that employ 'dma_virt_ops' which should
fairly simply handle Jason's concerns that this work might break with
the HFI, QIB and rxe drivers that use the virtual ops to implement
their own special DMA operations.
--
This is a continuation of our work to enable using Peer-to-Peer PCI
memory in the kernel with initial support for the NVMe fabrics target
subsystem. Many thanks go to Christoph Hellwig who provided valuable
feedback to get these patches to where they are today.
The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVMe target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU).
Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch hierarchy. This will mean many setups that
could likely work well will not be supported so that we can be more
confident it will work and not place any responsibility on the user to
understand their topology. (We chose to go this route based on feedback
we received at the last LSF). Future work may enable these transfers
using a white list of known good root complexes. However, at this time,
there is no reliable way to ensure that Peer-to-Peer transactions are
permitted between PCI Root Ports.
In order to enable this functionality, we introduce a few new PCI
functions such that a driver can register P2P memory with the system.
Struct pages are created for this memory using devm_memremap_pages()
and the PCI bus offset is stored in the corresponding pagemap structure.
When the PCI P2PDMA config option is selected the ACS bits in every
bridge port in the system are turned off to allow traffic to
pass freely behind the root port. At this time, the bit must be disabled
at boot so the IOMMU subsystem can correctly create the groups, though
this could be addressed in the future. There is no way to dynamically
disable the bit and alter the groups.
Another set of functions allow a client driver to create a list of
client devices that will be used in a given P2P transactions and then
use that list to find any P2P memory that is supported by all the
client devices.
In the block layer, we also introduce a P2P request flag to indicate a
given request targets P2P memory as well as a flag for a request queue
to indicate a given queue supports targeting P2P memory. P2P requests
will only be accepted by queues that support it. Also, P2P requests
are marked to not be merged seeing a non-homogenous request would
complicate the DMA mapping requirements.
In the PCI NVMe driver, we modify the existing CMB support to utilize
the new PCI P2P memory infrastructure and also add support for P2P
memory in its request queue. When a P2P request is received it uses the
pci_p2pmem_map_sg() function which applies the necessary transformation
to get the corrent pci_bus_addr_t for the DMA transactions.
In the RDMA core, we also adjust rdma_rw_ctx_init() and
rdma_rw_ctx_destroy() to take a flags argument which indicates whether
to use the PCI P2P mapping functions or not. To avoid odd RDMA devices
that don't use the proper DMA infrastructure this code rejects using
any device that employs the virt_dma_ops implementation.
Finally, in the NVMe fabrics target port we introduce a new
configuration boolean: 'allow_p2pmem'. When set, the port will attempt
to find P2P memory supported by the RDMA NIC and all namespaces. If
supported memory is found, it will be used in all IO transfers. And if
a port is using P2P memory, adding new namespaces that are not supported
by that memory will fail.
These patches have been tested on a number of Intel based systems and
for a variety of RDMA NICs (Mellanox, Broadcomm, Chelsio) and NVMe
SSDs (Intel, Seagate, Samsung) and p2pdma devices (Eideticom,
Microsemi, Chelsio and Everspin) using switches from both Microsemi
and Broadcomm.
Logan Gunthorpe (13):
PCI/P2PDMA: Support peer-to-peer memory
PCI/P2PDMA: Add sysfs group to display p2pmem stats
PCI/P2PDMA: Add PCI p2pmem DMA mappings to adjust the bus offset
PCI/P2PDMA: Introduce configfs/sysfs enable attribute helpers
docs-rst: Add a new directory for PCI documentation
PCI/P2PDMA: Add P2P DMA driver writer's documentation
block: Add PCI P2P flag for request queue and check support for
requests
IB/core: Ensure we map P2P memory correctly in
rdma_rw_ctx_[init|destroy]()
nvme-pci: Use PCI p2pmem subsystem to manage the CMB
nvme-pci: Add support for P2P memory in requests
nvme-pci: Add a quirk for a pseudo CMB
nvmet: Introduce helper functions to allocate and free request SGLs
nvmet: Optionally use PCI P2P memory
Documentation/ABI/testing/sysfs-bus-pci | 25 +
Documentation/driver-api/index.rst | 2 +-
Documentation/driver-api/pci/index.rst | 21 +
Documentation/driver-api/pci/p2pdma.rst | 170 ++++++
Documentation/driver-api/{ => pci}/pci.rst | 0
block/blk-core.c | 14 +
drivers/infiniband/core/rw.c | 11 +-
drivers/nvme/host/core.c | 4 +
drivers/nvme/host/nvme.h | 8 +
drivers/nvme/host/pci.c | 121 ++--
drivers/nvme/target/configfs.c | 36 ++
drivers/nvme/target/core.c | 149 +++++
drivers/nvme/target/nvmet.h | 15 +
drivers/nvme/target/rdma.c | 22 +-
drivers/pci/Kconfig | 17 +
drivers/pci/Makefile | 1 +
drivers/pci/p2pdma.c | 941 +++++++++++++++++++++++++++++
include/linux/blkdev.h | 3 +
include/linux/memremap.h | 6 +
include/linux/mm.h | 18 +
include/linux/pci-p2pdma.h | 124 ++++
include/linux/pci.h | 4 +
22 files changed, 1658 insertions(+), 54 deletions(-)
create mode 100644 Documentation/driver-api/pci/index.rst
create mode 100644 Documentation/driver-api/pci/p2pdma.rst
rename Documentation/driver-api/{ => pci}/pci.rst (100%)
create mode 100644 drivers/pci/p2pdma.c
create mode 100644 include/linux/pci-p2pdma.h
--
2.11.0
3 years, 9 months