[PATCH v2 0/3] Add support for memcpy_mcsafe
by Balbir Singh
memcpy_mcsafe() is an API currently used by the pmem subsystem to convert
errors while doing a memcpy (machine check exception errors) to a return
value. This patchset consists of three patches
1. The first patch is a bug fix to handle machine check errors correctly
while walking the page tables in kernel mode, due to huge pmd/pud sizes
2. The second patch adds memcpy_mcsafe() support, this is largely derived
from existing code
3. The third patch registers for callbacks on machine check exceptions and
in them uses specialized knowledge of the type of page to decide whether
to handle the MCE as is or to return to a fixup address present in
memcpy_mcsafe(). If a fixup address is used, then we return an error
value of -EFAULT to the caller.
Testing
A large part of the testing was done under a simulator by selectively
inserting machine check exceptions in a test driver doing memcpy_mcsafe
via ioctls.
Changelog v2
- Fix the logic of shifting in addr_to_pfn
- Use shift consistently instead of PAGE_SHIFT
- Fix a typo in patch1
Balbir Singh (3):
powerpc/mce: Bug fixes for MCE handling in kernel space
powerpc/memcpy: Add memcpy_mcsafe for pmem
powerpc/mce: Handle memcpy_mcsafe
arch/powerpc/include/asm/mce.h | 3 +-
arch/powerpc/include/asm/string.h | 2 +
arch/powerpc/kernel/mce.c | 77 ++++++++++++-
arch/powerpc/kernel/mce_power.c | 26 +++--
arch/powerpc/lib/Makefile | 2 +-
arch/powerpc/lib/memcpy_mcsafe_64.S | 212 ++++++++++++++++++++++++++++++++++++
6 files changed, 308 insertions(+), 14 deletions(-)
create mode 100644 arch/powerpc/lib/memcpy_mcsafe_64.S
--
2.13.6
2 years, 6 months
Re: Detecting NUMA per pmem
by Oren Berman
Hi Ross
Thanks for the speedy reply. I am also adding the public list to this
thread as you suggested.
We have tried to dump the SPA table and this is what we get:
/*
* Intel ACPI Component Architecture
* AML/ASL+ Disassembler version 20160108-64
* Copyright (c) 2000 - 2016 Intel Corporation
*
* Disassembly of NFIT, Sun Oct 22 10:46:19 2017
*
* ACPI Data Table [NFIT]
*
* Format: [HexOffset DecimalOffset ByteLength] FieldName : FieldValue
*/
[000h 0000 4] Signature : "NFIT" [NVDIMM Firmware
Interface Table]
[004h 0004 4] Table Length : 00000028
[008h 0008 1] Revision : 01
[009h 0009 1] Checksum : B2
[00Ah 0010 6] Oem ID : "SUPERM"
[010h 0016 8] Oem Table ID : "SMCI--MB"
[018h 0024 4] Oem Revision : 00000001
[01Ch 0028 4] Asl Compiler ID : " "
[020h 0032 4] Asl Compiler Revision : 00000001
[024h 0036 4] Reserved : 00000000
Raw Table Data: Length 40 (0x28)
0000: 4E 46 49 54 28 00 00 00 01 B2 53 55 50 45 52 4D // NFIT(.....SUPERM
0010: 53 4D 43 49 2D 2D 4D 42 01 00 00 00 01 00 00 00 // SMCI--MB........
0020: 01 00 00 00 00 00 00 00
As you can see the memory region info is missing.
This specific check was done on a supermicro server.
We also performed a bios update but the results were the same.
As said before ,the pmem devices are detected correctly and we verified
that they correspond to different numa nodes using the PCM utility.However,
linux still reports both pmem devices to be on the same numa - Numa 0.
If this information is missing, why pmem devices and address ranges are
still detected correctly?
Is there another table that we need to check?
I also ran dmidecode and the NVDIMMs are being listed (we tested with
netlist NVDIMMs). I can also see the bank locator showing P0 and P1 which I
think indicates the numa. Here is an example:
Handle 0x002D, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x002A
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: P1-DIMMA3
Bank Locator: P0_Node0_Channel0_Dimm2
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MHz
Manufacturer: Netlist
Serial Number: 66F50006
Asset Tag: P1-DIMMA3_AssetTag (date:16/42)
Part Number: NV3A74SBT20-000
Rank: 1
Configured Clock Speed: 1600 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown
Handle 0x003B, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0038
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: P2-DIMME3
Bank Locator: P1_Node1_Channel0_Dimm2
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MHz
Manufacturer: Netlist
Serial Number: 66B50010
Asset Tag: P2-DIMME3_AssetTag (date:16/42)
Part Number: NV3A74SBT20-000
Rank: 1
Configured Clock Speed: 1600 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown
Did you encounter such a a case? We would appreciate any insight you might
have.
BR
Oren Berman
On 20 October 2017 at 19:22, Ross Zwisler <ross.zwisler(a)linux.intel.com>
wrote:
> On Thu, Oct 19, 2017 at 06:12:24PM +0300, Oren Berman wrote:
> > Hi Ross
> > My name is Oren Berman and I am a senior developer at lightbitslabs.
> > We are working with NDIMMs but we encountered a problem that the
> kernel
> > does not seem to detect the numa id per PMEM device.
> > It always reports numa 0 although we have NVDIMM devices on both
> nodes.
> > We checked that it always returns 0 from sysfs and also from
> retrieving
> > the device of pmem in the kernel and calling dev_to_node.
> > The result is always 0 for both pmem0 and pmem1.
> > In order to make sure that indeed both numa sockets are used we ran
> > intel's pcm utlity. We verified that writing to pmem 0 increases
> socket 0
> > utilization and writing to pmem1 increases socket 1 utilization so
> the hw
> > works properly.
> > Only the detection seems to be invalid.
> > Did you encounter such a problem?
> > We are using kernel version 4.9 - are you aware of any fix for this
> issue
> > or workaround that we can use.
> > Are we missing something?
> > Thanks for any help you can give us.
> > BR
> > Oren Berman
>
> Hi Oren,
>
> My first guess is that your platform isn't properly filling out the
> "proximity
> domain" field in the NFIT SPA table.
>
> See section 5.2.25.2 in ACPI 6.2:
> http://uefi.org/sites/default/files/resources/ACPI_6_2.pdf
>
> Here's how to check that:
>
> # cd /tmp
> # cp /sys/firmware/acpi/tables/NFIT .
> # iasl NFIT
>
> Intel ACPI Component Architecture
> ASL+ Optimizing Compiler version 20160831-64
> Copyright (c) 2000 - 2016 Intel Corporation
>
> Binary file appears to be a valid ACPI table, disassembling
> Input file NFIT, Length 0xE0 (224) bytes
> ACPI: NFIT 0x0000000000000000 0000E0 (v01 BOCHS BXPCNFIT 00000001 BXPC
> 00000001)
> Acpi Data Table [NFIT] decoded
> Formatted output: NFIT.dsl - 5191 bytes
>
> This will give you an NFIT.dsl file which you can look at. Here is what my
> SPA table looks like for an emulated QEMU NVDIMM:
>
> [028h 0040 2] Subtable Type : 0000 [System Physical
> Address Range]
> [02Ah 0042 2] Length : 0038
>
> [02Ch 0044 2] Range Index : 0002
> [02Eh 0046 2] Flags (decoded below) : 0003
> Add/Online Operation Only : 1
> Proximity Domain Valid : 1
> [030h 0048 4] Reserved : 00000000
> [034h 0052 4] Proximity Domain : 00000000
> [038h 0056 16] Address Range GUID :
> 66F0D379-B4F3-4074-AC43-0D3318B78CDB
> [048h 0072 8] Address Range Base : 0000000240000000
> [050h 0080 8] Address Range Length : 0000000440000000
> [058h 0088 8] Memory Map Attribute : 0000000000008008
>
> So, the "Proximity Domain" field is 0, and this lets the system know which
> NUMA node to associate with this memory region.
>
> BTW, in the future it's best to CC our public list,
> linux-nvdimm(a)lists.01.org,
> as a) someone else might have the same question and b) someone else might
> know
> the answer.
>
> Thanks,
> - Ross
>
2 years, 8 months
[PATCH v3 0/2] Support ACPI 6.1 update in NFIT Control Region Structure
by Toshi Kani
ACPI 6.1, Table 5-133, updates NVDIMM Control Region Structure as
follows.
- Valid Fields, Manufacturing Location, and Manufacturing Date
are added from reserved range. No change in the structure size.
- IDs (SPD values) are stored as arrays of bytes (i.e. big-endian
format). The spec clarifies that they need to be represented
as arrays of bytes as well.
Patch 1 changes the NFIT driver to comply with ACPI 6.1.
Patch 2 adds a new sysfs file "id" to show NVDIMM ID defined in ACPI 6.1.
The patch-set applies on linux-pm.git acpica.
link: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
---
v3:
- Need to coordinate with ACPICA update (Bob Moore, Dan Williams)
- Integrate with ACPICA changes in struct acpi_nfit_control_region.
(commit 138a95547ab0)
v2:
- Remove 'mfg_location' and 'mfg_date'. (Dan Williams)
- Rename 'unique_id' to 'id' and make this change as a separate patch.
(Dan Williams)
---
Toshi Kani (3):
1/2 acpi/nfit: Update nfit driver to comply with ACPI 6.1
2/3 acpi/nfit: Add sysfs "id" for NVDIMM ID
---
drivers/acpi/nfit.c | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)
2 years, 8 months
[RFC v2 0/2] kvm "fake DAX" device flushing
by Pankaj Gupta
This is RFC V2 for 'fake DAX' flushing interface sharing
for review. This patchset has two main parts:
- Guest virtio-pmem driver
Guest driver reads persistent memory range from paravirt
device and registers with 'nvdimm_bus'. 'nvdimm/pmem'
driver uses this information to allocate persistent
memory range. Also, we have implemented guest side of
VIRTIO flushing interface.
- Qemu virtio-pmem device
It exposes a persistent memory range to KVM guest which
at host side is file backed memory and works as persistent
memory device. In addition to this it provides virtio
device handling of flushing interface. KVM guest performs
Qemu side asynchronous sync using this interface.
Changes from previous RFC[1]:
- Reuse existing 'pmem' code for registering persistent
memory and other operations instead of creating an entirely
new block driver.
- Use VIRTIO driver to register memory information with
nvdimm_bus and create region_type accordingly.
- Call VIRTIO flush from existing pmem driver.
Details of project idea for 'fake DAX' flushing interface is
shared [2] & [3].
Pankaj Gupta (2):
Add virtio-pmem guest driver
pmem: device flush over VIRTIO
[1] https://marc.info/?l=linux-mm&m=150782346802290&w=2
[2] https://www.spinics.net/lists/kvm/msg149761.html
[3] https://www.spinics.net/lists/kvm/msg153095.html
drivers/nvdimm/region_devs.c | 7 ++
drivers/virtio/Kconfig | 12 +++
drivers/virtio/Makefile | 1
drivers/virtio/virtio_pmem.c | 118 +++++++++++++++++++++++++++++++++++++++
include/linux/libnvdimm.h | 4 +
include/uapi/linux/virtio_ids.h | 1
include/uapi/linux/virtio_pmem.h | 58 +++++++++++++++++++
7 files changed, 201 insertions(+)
2 years, 8 months
[PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
by Logan Gunthorpe
Hi Everyone,
Here's v4 of our series to introduce P2P based copy offload to NVMe
fabrics. This version has been rebased onto v4.17-rc2. A git repo
is here:
https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
Thanks,
Logan
Changes in v4:
* Change the original upstream_bridges_match() function to
upstream_bridge_distance() which calculates the distance between two
devices as long as they are behind the same root port. This should
address Bjorn's concerns that the code was to focused on
being behind a single switch.
* The disable ACS function now disables ACS for all bridge ports instead
of switch ports (ie. those that had two upstream_bridge ports).
* Change the pci_p2pmem_alloc_sgl() and pci_p2pmem_free_sgl()
API to be more like sgl_alloc() in that the alloc function returns
the allocated scatterlist and nents is not required bythe free
function.
* Moved the new documentation into the driver-api tree as requested
by Jonathan
* Add SGL alloc and free helpers in the nvmet code so that the
individual drivers can share the code that allocates P2P memory.
As requested by Christoph.
* Cleanup the nvmet_p2pmem_store() function as Christoph
thought my first attempt was ugly.
* Numerous commit message and comment fix-ups
Changes in v3:
* Many more fixes and minor cleanups that were spotted by Bjorn
* Additional explanation of the ACS change in both the commit message
and Kconfig doc. Also, the code that disables the ACS bits is surrounded
explicitly by an #ifdef
* Removed the flag we added to rdma_rw_ctx() in favour of using
is_pci_p2pdma_page(), as suggested by Sagi.
* Adjust pci_p2pmem_find() so that it prefers P2P providers that
are closest to (or the same as) the clients using them. In cases
of ties, the provider is randomly chosen.
* Modify the NVMe Target code so that the PCI device name of the provider
may be explicitly specified, bypassing the logic in pci_p2pmem_find().
(Note: it's still enforced that the provider must be behind the
same switch as the clients).
* As requested by Bjorn, added documentation for driver writers.
Changes in v2:
* Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
as a bunch of cleanup and spelling fixes he pointed out in the last
series.
* To address Alex's ACS concerns, we change to a simpler method of
just disabling ACS behind switches for any kernel that has
CONFIG_PCI_P2PDMA.
* We also reject using devices that employ 'dma_virt_ops' which should
fairly simply handle Jason's concerns that this work might break with
the HFI, QIB and rxe drivers that use the virtual ops to implement
their own special DMA operations.
--
This is a continuation of our work to enable using Peer-to-Peer PCI
memory in the kernel with initial support for the NVMe fabrics target
subsystem. Many thanks go to Christoph Hellwig who provided valuable
feedback to get these patches to where they are today.
The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVMe target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU).
Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch hierarchy. This will mean many setups that
could likely work well will not be supported so that we can be more
confident it will work and not place any responsibility on the user to
understand their topology. (We chose to go this route based on feedback
we received at the last LSF). Future work may enable these transfers
using a white list of known good root complexes. However, at this time,
there is no reliable way to ensure that Peer-to-Peer transactions are
permitted between PCI Root Ports.
In order to enable this functionality, we introduce a few new PCI
functions such that a driver can register P2P memory with the system.
Struct pages are created for this memory using devm_memremap_pages()
and the PCI bus offset is stored in the corresponding pagemap structure.
When the PCI P2PDMA config option is selected the ACS bits in every
bridge port in the system are turned off to allow traffic to
pass freely behind the root port. At this time, the bit must be disabled
at boot so the IOMMU subsystem can correctly create the groups, though
this could be addressed in the future. There is no way to dynamically
disable the bit and alter the groups.
Another set of functions allow a client driver to create a list of
client devices that will be used in a given P2P transactions and then
use that list to find any P2P memory that is supported by all the
client devices.
In the block layer, we also introduce a P2P request flag to indicate a
given request targets P2P memory as well as a flag for a request queue
to indicate a given queue supports targeting P2P memory. P2P requests
will only be accepted by queues that support it. Also, P2P requests
are marked to not be merged seeing a non-homogenous request would
complicate the DMA mapping requirements.
In the PCI NVMe driver, we modify the existing CMB support to utilize
the new PCI P2P memory infrastructure and also add support for P2P
memory in its request queue. When a P2P request is received it uses the
pci_p2pmem_map_sg() function which applies the necessary transformation
to get the corrent pci_bus_addr_t for the DMA transactions.
In the RDMA core, we also adjust rdma_rw_ctx_init() and
rdma_rw_ctx_destroy() to take a flags argument which indicates whether
to use the PCI P2P mapping functions or not. To avoid odd RDMA devices
that don't use the proper DMA infrastructure this code rejects using
any device that employs the virt_dma_ops implementation.
Finally, in the NVMe fabrics target port we introduce a new
configuration boolean: 'allow_p2pmem'. When set, the port will attempt
to find P2P memory supported by the RDMA NIC and all namespaces. If
supported memory is found, it will be used in all IO transfers. And if
a port is using P2P memory, adding new namespaces that are not supported
by that memory will fail.
These patches have been tested on a number of Intel based systems and
for a variety of RDMA NICs (Mellanox, Broadcomm, Chelsio) and NVMe
SSDs (Intel, Seagate, Samsung) and p2pdma devices (Eideticom,
Microsemi, Chelsio and Everspin) using switches from both Microsemi
and Broadcomm.
Logan Gunthorpe (14):
PCI/P2PDMA: Support peer-to-peer memory
PCI/P2PDMA: Add sysfs group to display p2pmem stats
PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
docs-rst: Add a new directory for PCI documentation
PCI/P2PDMA: Add P2P DMA driver writer's documentation
block: Introduce PCI P2P flags for request and request queue
IB/core: Ensure we map P2P memory correctly in
rdma_rw_ctx_[init|destroy]()
nvme-pci: Use PCI p2pmem subsystem to manage the CMB
nvme-pci: Add support for P2P memory in requests
nvme-pci: Add a quirk for a pseudo CMB
nvmet: Introduce helper functions to allocate and free request SGLs
nvmet-rdma: Use new SGL alloc/free helper for requests
nvmet: Optionally use PCI P2P memory
Documentation/ABI/testing/sysfs-bus-pci | 25 +
Documentation/PCI/index.rst | 14 +
Documentation/driver-api/index.rst | 2 +-
Documentation/driver-api/pci/index.rst | 20 +
Documentation/driver-api/pci/p2pdma.rst | 166 ++++++
Documentation/driver-api/{ => pci}/pci.rst | 0
Documentation/index.rst | 3 +-
block/blk-core.c | 3 +
drivers/infiniband/core/rw.c | 13 +-
drivers/nvme/host/core.c | 4 +
drivers/nvme/host/nvme.h | 8 +
drivers/nvme/host/pci.c | 118 +++--
drivers/nvme/target/configfs.c | 67 +++
drivers/nvme/target/core.c | 143 ++++-
drivers/nvme/target/io-cmd.c | 3 +
drivers/nvme/target/nvmet.h | 15 +
drivers/nvme/target/rdma.c | 22 +-
drivers/pci/Kconfig | 26 +
drivers/pci/Makefile | 1 +
drivers/pci/p2pdma.c | 814 +++++++++++++++++++++++++++++
drivers/pci/pci.c | 6 +
include/linux/blk_types.h | 18 +-
include/linux/blkdev.h | 3 +
include/linux/memremap.h | 19 +
include/linux/pci-p2pdma.h | 118 +++++
include/linux/pci.h | 4 +
26 files changed, 1579 insertions(+), 56 deletions(-)
create mode 100644 Documentation/PCI/index.rst
create mode 100644 Documentation/driver-api/pci/index.rst
create mode 100644 Documentation/driver-api/pci/p2pdma.rst
rename Documentation/driver-api/{ => pci}/pci.rst (100%)
create mode 100644 drivers/pci/p2pdma.c
create mode 100644 include/linux/pci-p2pdma.h
--
2.11.0
2 years, 9 months
Re: [dm-devel] [PATCH] dm-writecache
by Christoph Hellwig
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6/drivers/md/dm-writecache.c 2018-03-08 14:23:31.059999000 +0100
> @@ -0,0 +1,2417 @@
> +#include <linux/device-mapper.h>
missing copyright statement, or for those new-fashioned SPDX statement.
> +#define WRITEBACK_FUA true
no business having this around.
> +#ifndef bio_set_dev
> +#define bio_set_dev(bio, dev) ((bio)->bi_bdev = (dev))
> +#endif
> +#ifndef timer_setup
> +#define timer_setup(t, c, f) setup_timer(t, c, (unsigned long)(t))
> +#endif
no business in mainline.
> +/*
> + * On X86, non-temporal stores are more efficient than cache flushing.
> + * On ARM64, cache flushing is more efficient.
> + */
> +#if defined(CONFIG_X86_64)
> +#define NT_STORE(dest, src) \
> +do { \
> + typeof(src) val = (src); \
> + memcpy_flushcache(&(dest), &val, sizeof(src)); \
> +} while (0)
> +#define COMMIT_FLUSHED() wmb()
> +#else
> +#define NT_STORE(dest, src) WRITE_ONCE(dest, src)
> +#define FLUSH_RANGE dax_flush
> +#define COMMIT_FLUSHED() do { } while (0)
> +#endif
Please use proper APIs for this, this has no business in a driver.
And that's it for now. This is clearly not submission ready, and I
should got back to my backlog of other things.
2 years, 9 months
[PATCH 1/3] nvdimm: fix typo in label-size definition
by Ross Zwisler
Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Fixes: commit da6789c27c2e ("nvdimm: add a macro for property "label-size"")
Cc: Haozhong Zhang <haozhong.zhang(a)intel.com>
Cc: Michael S. Tsirkin <mst(a)redhat.com>
Cc: Stefan Hajnoczi <stefanha(a)redhat.com>
---
hw/mem/nvdimm.c | 2 +-
include/hw/mem/nvdimm.h | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
index acb656b672..4087aca25e 100644
--- a/hw/mem/nvdimm.c
+++ b/hw/mem/nvdimm.c
@@ -89,7 +89,7 @@ static void nvdimm_set_unarmed(Object *obj, bool value, Error **errp)
static void nvdimm_init(Object *obj)
{
- object_property_add(obj, NVDIMM_LABLE_SIZE_PROP, "int",
+ object_property_add(obj, NVDIMM_LABEL_SIZE_PROP, "int",
nvdimm_get_label_size, nvdimm_set_label_size, NULL,
NULL, NULL);
object_property_add_bool(obj, NVDIMM_UNARMED_PROP,
diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
index 7fd87c4e1c..74c60332e1 100644
--- a/include/hw/mem/nvdimm.h
+++ b/include/hw/mem/nvdimm.h
@@ -48,7 +48,7 @@
#define NVDIMM_GET_CLASS(obj) OBJECT_GET_CLASS(NVDIMMClass, (obj), \
TYPE_NVDIMM)
-#define NVDIMM_LABLE_SIZE_PROP "label-size"
+#define NVDIMM_LABEL_SIZE_PROP "label-size"
#define NVDIMM_UNARMED_PROP "unarmed"
struct NVDIMMDevice {
--
2.14.3
2 years, 9 months
[PATCH v9 0/9] dax: fix dma vs truncate/hole-punch
by Dan Williams
Changes since v8 [1]:
* Rebase on v4.17-rc2
* Fix get_user_pages_fast() for ZONE_DEVICE pages to revalidate the pte,
pmd, pud after taking references (Jan)
* Kill dax_layout_lock(). With get_user_pages_fast() for ZONE_DEVICE
fixed we can then rely on the {pte,pmd}_lock to synchronize
dax_layout_busy_page() vs new page references (Jan)
* Hold the iolock over repeated invocations of dax_layout_busy_page() to
enable truncate/hole-punch to make forward progress in the presence of
a constant stream of new direct-I/O requests (Jan).
[1]: https://lists.01.org/pipermail/linux-nvdimm/2018-March/015058.html
---
Background:
get_user_pages() in the filesystem pins file backed memory pages for
access by devices performing dma. However, it only pins the memory pages
not the page-to-file offset association. If a file is truncated the
pages are mapped out of the file and dma may continue indefinitely into
a page that is owned by a device driver. This breaks coherency of the
file vs dma, but the assumption is that if userspace wants the
file-space truncated it does not matter what data is inbound from the
device, it is not relevant anymore. The only expectation is that dma can
safely continue while the filesystem reallocates the block(s).
Problem:
This expectation that dma can safely continue while the filesystem
changes the block map is broken by dax. With dax the target dma page
*is* the filesystem block. The model of leaving the page pinned for dma,
but truncating the file block out of the file, means that the filesytem
is free to reallocate a block under active dma to another file and now
the expected data-incoherency situation has turned into active
data-corruption.
Solution:
Defer all filesystem operations (fallocate(), truncate()) on a dax mode
file while any page/block in the file is under active dma. This solution
assumes that dma is transient. Cases where dma operations are known to
not be transient, like RDMA, have been explicitly disabled via
commits like 5f1d43de5416 "IB/core: disable memory registration of
filesystem-dax vmas".
The dax_layout_busy_page() routine is called by filesystems with a lock
held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
The process of looking up a busy page invalidates all mappings
to trigger any subsequent get_user_pages() to block on i_mmap_lock.
The filesystem continues to call dax_layout_busy_page() until it finally
returns no more active pages. This approach assumes that the page
pinning is transient, if that assumption is violated the system would
have likely hung from the uncompleted I/O.
---
Dan Williams (9):
dax, dm: introduce ->fs_{claim,release}() dax_device infrastructure
mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks
memremap: split devm_memremap_pages() and memremap() infrastructure
mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS
mm: fix __gup_device_huge vs unmap
mm, fs, dax: handle layout changes to pinned dax mappings
xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
xfs: prepare xfs_break_layouts() for another layout type
xfs, dax: introduce xfs_break_dax_layouts()
drivers/dax/super.c | 99 ++++++++++++++++++++--
drivers/md/dm.c | 57 +++++++++++++
drivers/nvdimm/pmem.c | 3 -
fs/Kconfig | 2
fs/dax.c | 97 +++++++++++++++++++++
fs/ext2/super.c | 6 +
fs/ext4/super.c | 6 +
fs/xfs/xfs_file.c | 72 +++++++++++++++-
fs/xfs/xfs_inode.h | 16 ++++
fs/xfs/xfs_ioctl.c | 8 --
fs/xfs/xfs_iops.c | 16 ++--
fs/xfs/xfs_pnfs.c | 16 ++--
fs/xfs/xfs_pnfs.h | 6 +
fs/xfs/xfs_super.c | 20 ++--
include/linux/dax.h | 71 +++++++++++++++-
include/linux/memremap.h | 25 ++----
include/linux/mm.h | 71 ++++++++++++----
kernel/Makefile | 3 -
kernel/iomem.c | 167 +++++++++++++++++++++++++++++++++++++
kernel/memremap.c | 208 ++++++----------------------------------------
mm/Kconfig | 5 +
mm/gup.c | 37 ++++++--
mm/hmm.c | 13 ---
mm/swap.c | 3 -
24 files changed, 730 insertions(+), 297 deletions(-)
create mode 100644 kernel/iomem.c
2 years, 9 months
[PATCH v5 0/4] ndctl: convert actions to use util_filter_walk
by Dave Jiang
util_filter_walk() does the looping through bus/dimm/region/namespace
that a lot of the operations in ndctl uses. Converting them to common
code and reduce maintenance on individual versions of the same code.
In this series we are convering namespace, region, and dimm actions.
---
v5:
- fix behavior regression in filter_namespace (Dan)
- fix segfault caused by no namespace for create_namespace actions.
v4:
- change struct names to be less confusing. (Dan)
v3:
- fixed some corner cases in namespace patch.
- changed param renaming to reduce change for util_filter_params. (Dan)
- Adding conversion to region
- Adding conversion to dimm
v2:
- split out the conversion of util_filter_params to make things more
readable (Dan).
- Not pass in mode as util_filter_params and put back the mode check in
util_filter_walk() (Dan).
Dave Jiang (4):
ndctl: convert namespace actions to use util_filter_params
ndctl: convert namespace actions to use util_filter_walk()
ndctl: convert region actions to use util_filter_walk()
ndctl: convert dimm actions to use util_filter_walk()
ndctl/dimm.c | 83 ++++++++++++++---------
ndctl/namespace.c | 194 +++++++++++++++++++++++++++++------------------------
ndctl/region.c | 59 ++++++++++------
test/btt-check.sh | 2 -
util/filter.c | 5 +
util/filter.h | 23 ++++++
6 files changed, 220 insertions(+), 146 deletions(-)
--
Signature
2 years, 9 months
[PATCH v2] acpi: nfit: document sysfs interface
by Aishwarya Pant
This is an attempt to document the nfit sysfs interface. The
descriptions have been collected from git commit logs and the ACPI
specification 6.2.
Signed-off-by: Aishwarya Pant <aishpant(a)gmail.com>
---
Changes in v2:
- Add descriptions for range_index and ecc_unit_size
- Edit various descriptions as suggested
Documentation/ABI/testing/sysfs-bus-nfit | 233 +++++++++++++++++++++++++++++++
1 file changed, 233 insertions(+)
create mode 100644 Documentation/ABI/testing/sysfs-bus-nfit
diff --git a/Documentation/ABI/testing/sysfs-bus-nfit b/Documentation/ABI/testing/sysfs-bus-nfit
new file mode 100644
index 000000000000..619eb8ca0f99
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-nfit
@@ -0,0 +1,233 @@
+For all of the nmem device attributes under nfit/*, see the 'NVDIMM Firmware
+Interface Table (NFIT)' section in the ACPI specification
+(http://www.uefi.org/specifications) for more details.
+
+What: /sys/bus/nd/devices/nmemX/nfit/serial
+Date: Jun, 2015
+KernelVersion: v4.2
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Serial number of the NVDIMM (non-volatile dual in-line
+ memory module), assigned by the module vendor.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/handle
+Date: Apr, 2015
+KernelVersion: v4.2
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) The address (given by the _ADR object) of the device on its
+ parent bus of the NVDIMM device containing the NVDIMM region.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/device
+Date: Apr, 2015
+KernelVersion: v4.1
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Device id for the NVDIMM, assigned by the module vendor.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/rev_id
+Date: Jun, 2015
+KernelVersion: v4.2
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Revision of the NVDIMM, assigned by the module vendor.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/phys_id
+Date: Apr, 2015
+KernelVersion: v4.2
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Handle (i.e., instance number) for the SMBIOS (system
+ management BIOS) Memory Device structure describing the NVDIMM
+ containing the NVDIMM region.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/flags
+Date: Jun, 2015
+KernelVersion: v4.2
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) The flags in the NFIT memory device sub-structure indicate
+ the state of the data on the nvdimm relative to its energy
+ source or last "flush to persistence".
+
+ The attribute is a translation of the 'NVDIMM State Flags' field
+ in section 5.2.25.3 'NVDIMM Region Mapping' Structure of the
+ ACPI specification 6.2.
+
+ The health states are "save_fail", "restore_fail", "flush_fail",
+ "not_armed", "smart_event", "map_fail" and "smart_notify".
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/format
+What: /sys/bus/nd/devices/nmemX/nfit/format1
+What: /sys/bus/nd/devices/nmemX/nfit/formats
+Date: Apr, 2016
+KernelVersion: v4.7
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) The interface codes indicate support for persistent memory
+ mapped directly into system physical address space and / or a
+ block aperture access mechanism to the NVDIMM media.
+ The 'formats' attribute displays the number of supported
+ interfaces.
+
+ This layout is compatible with existing libndctl binaries that
+ only expect one code per-dimm as they will ignore
+ nmemX/nfit/formats and nmemX/nfit/formatN.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/vendor
+Date: Apr, 2016
+KernelVersion: v4.7
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Vendor id of the NVDIMM.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/dsm_mask
+Date: May, 2016
+KernelVersion: v4.7
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) The bitmask indicates the supported device specific control
+ functions relative to the NVDIMM command family supported by the
+ device
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/family
+Date: Apr, 2016
+KernelVersion: v4.7
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Displays the NVDIMM family command sets. Values
+ 0, 1, 2 and 3 correspond to NVDIMM_FAMILY_INTEL,
+ NVDIMM_FAMILY_HPE1, NVDIMM_FAMILY_HPE2 and NVDIMM_FAMILY_MSFT
+ respectively.
+
+ See the specifications for these command families here:
+ http://pmem.io/documents/NVDIMM_DSM_Interface-V1.6.pdf
+ https://github.com/HewlettPackard/hpe-nvm/blob/master/Documentation/
+ https://msdn.microsoft.com/library/windows/hardware/mt604741"
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/id
+Date: Apr, 2016
+KernelVersion: v4.7
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) ACPI specification 6.2 section 5.2.25.9, defines an
+ identifier for an NVDIMM, which refelects the id attribute.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/subsystem_vendor
+Date: Apr, 2016
+KernelVersion: v4.7
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Sub-system vendor id of the NVDIMM non-volatile memory
+ subsystem controller.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/subsystem_rev_id
+Date: Apr, 2016
+KernelVersion: v4.7
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Sub-system revision id of the NVDIMM non-volatile memory subsystem
+ controller, assigned by the non-volatile memory subsystem
+ controller vendor.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/subsystem_device
+Date: Apr, 2016
+KernelVersion: v4.7
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Sub-system device id for the NVDIMM non-volatile memory
+ subsystem controller, assigned by the non-volatile memory
+ subsystem controller vendor.
+
+
+What: /sys/bus/nd/devices/ndbusX/nfit/revision
+Date: Jun, 2015
+KernelVersion: v4.2
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) ACPI NFIT table revision number.
+
+
+What: /sys/bus/nd/devices/ndbusX/nfit/scrub
+Date: Sep, 2016
+KernelVersion: v4.9
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RW) This shows the number of full Address Range Scrubs (ARS)
+ that have been completed since driver load time. Userspace can
+ wait on this using select/poll etc. A '+' at the end indicates
+ an ARS is in progress
+
+ Writing a value of 1 triggers an ARS scan.
+
+
+What: /sys/bus/nd/devices/ndbusX/nfit/hw_error_scrub
+Date: Sep, 2016
+KernelVersion: v4.9
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RW) Provides a way to toggle the behavior between just adding
+ the address (cache line) where the MCE happened to the poison
+ list and doing a full scrub. The former (selective insertion of
+ the address) is done unconditionally.
+
+ This attribute can have the following values written to it:
+
+ '0': Switch to the default mode where an exception will only
+ insert the address of the memory error into the poison and
+ badblocks lists.
+ '1': Enable a full scrub to happen if an exception for a memory
+ error is received.
+
+
+What: /sys/bus/nd/devices/ndbusX/nfit/dsm_mask
+Date: Jun, 2017
+KernelVersion: v4.13
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) The bitmask indicates the supported bus specific control
+ functions. See the section named 'NVDIMM Root Device _DSMs' in
+ the ACPI specification.
+
+
+What: /sys/bus/nd/devices/regionX/nfit/range_index
+Date: Jun, 2015
+KernelVersion: v4.2
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) A unique number provided by the BIOS to identify an address
+ range. Used by NVDIMM Region Mapping Structure to uniquely refer
+ to this structure. Value of 0 is reserved and not used as an
+ index.
+
+
+What: /sys/bus/nd/devices/regionX/nfit/ecc_unit_size
+Date: Aug, 2017
+KernelVersion: v4.14
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Size of a write request to a DIMM that will not incur a
+ read-modify-write cycle at the memory controller.
+
+ When the nfit driver initializes it runs an ARS (Address Range
+ Scrub) operation across every pmem range. Part of that process
+ involves determining the ARS capabilities of a given address
+ range. One of the capabilities that is reported is the 'Clear
+ Uncorrectable Error Range Length Unit Size' (see: ACPI 6.2
+ section 9.20.7.4 Function Index 1 - Query ARS Capabilities).
+ This property indicates the boundary at which the NVDIMM may
+ need to perform read-modify-write cycles to maintain ECC (Error
+ Correcting Code) blocks.
--
2.16.2
2 years, 9 months