[PATCH 00/12] device-dax: Support sub-dividing soft-reserved ranges
by Dan Williams
The device-dax facility allows an address range to be directly mapped
through a chardev, or turned around and hotplugged to the core kernel
page allocator as System-RAM. It is the baseline mechanism for
converting persistent memory (pmem) to be used as another volatile
memory pool i.e. the current Memory Tiering hot topic on linux-mm.
In the case of pmem the nvdimm-namespace-label mechanism can sub-divide
it, but that labeling mechanism is not available / applicable to
soft-reserved ("EFI specific purpose") memory [1]. This series provides
a sysfs-mechanism for the daxctl utility to enable provisioning of
volatile-soft-reserved memory ranges.
The motivations for this facility are:
1/ Allow performance differentiated memory ranges to be split between
kernel-managed and directly-accessed use cases.
2/ Allow physical memory to be provisioned along performance relevant
address boundaries. For example, divide a memory-side cache [2] along
cache-color boundaries.
3/ Parcel out soft-reserved memory to VMs using device-dax as a security
/ permissions boundary [3]. Specifically I have seen people (ab)using
memmap=nn!ss (mark System-RAM as Peristent Memory) just to get the
device-dax interface on custom address ranges.
The baseline for this series is today's next/master + "[PATCH v2 0/6]
Manual definition of Soft Reserved memory devices" [4].
Big thanks to Joao for the early testing and feedback on this series!
Given the dependencies on the memremap_pages() reworks in Andrew's tree
and the proximity to v5.7 this is clearly v5.8 material. The patches in
most need of a second opinion are the memremap_pages() reworks to switch
from 'struct resource' to 'struct range' and allow for an array of
ranges to be mapped at once.
[1]: https://lore.kernel.org/r/157309097008.1579826.12818463304589384434.stgit...
[2]: https://lore.kernel.org/r/154899811738.3165233.12325692939590944259.stgit...
[3]: https://lore.kernel.org/r/20200110190313.17144-1-joao.m.martins@oracle.com/
[4]: http://lore.kernel.org/r/158489354353.1457606.8327903161927980740.stgit@d...
---
Dan Williams (12):
device-dax: Drop the dax_region.pfn_flags attribute
device-dax: Move instance creation parameters to 'struct dev_dax_data'
device-dax: Make pgmap optional for instance creation
device-dax: Kill dax_kmem_res
device-dax: Add an allocation interface for device-dax instances
device-dax: Introduce seed devices
drivers/base: Make device_find_child_by_name() compatible with sysfs inputs
device-dax: Add resize support
mm/memremap_pages: Convert to 'struct range'
mm/memremap_pages: Support multiple ranges per invocation
device-dax: Add dis-contiguous resource support
device-dax: Introduce 'mapping' devices
arch/powerpc/kvm/book3s_hv_uvmem.c | 14 -
drivers/base/core.c | 2
drivers/dax/bus.c | 877 ++++++++++++++++++++++++++++++--
drivers/dax/bus.h | 28 +
drivers/dax/dax-private.h | 36 +
drivers/dax/device.c | 97 ++--
drivers/dax/hmem/hmem.c | 18 -
drivers/dax/kmem.c | 170 +++---
drivers/dax/pmem/compat.c | 2
drivers/dax/pmem/core.c | 22 +
drivers/gpu/drm/nouveau/nouveau_dmem.c | 4
drivers/nvdimm/badrange.c | 26 -
drivers/nvdimm/claim.c | 13
drivers/nvdimm/nd.h | 3
drivers/nvdimm/pfn_devs.c | 13
drivers/nvdimm/pmem.c | 27 +
drivers/nvdimm/region.c | 21 -
drivers/pci/p2pdma.c | 12
include/linux/memremap.h | 9
include/linux/range.h | 6
mm/memremap.c | 297 ++++++-----
tools/testing/nvdimm/dax-dev.c | 22 +
tools/testing/nvdimm/test/iomap.c | 2
23 files changed, 1318 insertions(+), 403 deletions(-)
6 months, 2 weeks
[PATCH v2 0/3] Maintainer Entry Profiles
by Dan Williams
Changes since v1 [1]:
- Simplify the profile to a hopefully non-controversial set of
attributes that address the most common sources of contributor
confusion, or maintainer frustration.
- Rename "Subsystem Profile" to "Maintainer Entry Profile". Not every
entry in MAINTAINERS represents a full subsystem. There may be driver
local considerations to communicate to a submitter in addition to wider
subsystem guidelines.
- Delete the old P: tag in MAINTAINERS rather than convert to a new E:
tag (Joe Perches).
[1]: http://lore.kernel.org/r/154225759358.2499188.15268218778137905050.stgit@...
---
At last years Plumbers Conference I proposed the Maintainer Entry
Profile as a document that a maintainer can provide to set contributor
expectations and provide fodder for a discussion between maintainers
about the merits of different maintainer policies.
For those that did not attend, the goal of the Maintainer Entry Profile,
and the Maintainer Handbook more generally, is to provide a desk
reference for maintainers both new and experienced. The session
introduction was:
The first rule of kernel maintenance is that there are no hard and
fast rules. That state of affairs is both a blessing and a curse. It
has served the community well to be adaptable to the different
people and different problem spaces that inhabit the kernel
community. However, that variability also leads to inconsistent
experiences for contributors, little to no guidance for new
contributors, and unnecessary stress on current maintainers. There
are quite a few of people who have been around long enough to make
enough mistakes that they have gained some hard earned proficiency.
However if the kernel community expects to keep growing it needs to
be able both scale the maintainers it has and ramp new ones without
necessarily let them make a decades worth of mistakes to learn the
ropes.
To be clear, the proposed document does not impose or suggest new
rules. Instead it provides an outlet to document the unwritten rules
and policies in effect for each subsystem, and that each subsystem
might decide differently for whatever reason.
---
Dan Williams (3):
MAINTAINERS: Reclaim the P: tag for Maintainer Entry Profile
Maintainer Handbook: Maintainer Entry Profile
libnvdimm, MAINTAINERS: Maintainer Entry Profile
Documentation/maintainer/index.rst | 1
.../maintainer/maintainer-entry-profile.rst | 99 ++++++++++++++++++++
Documentation/nvdimm/maintainer-entry-profile.rst | 64 +++++++++++++
MAINTAINERS | 20 ++--
4 files changed, 175 insertions(+), 9 deletions(-)
create mode 100644 Documentation/maintainer/maintainer-entry-profile.rst
create mode 100644 Documentation/nvdimm/maintainer-entry-profile.rst
8 months, 3 weeks
[PATCH v5 0/4] powerpc/papr_scm: Add support for reporting nvdimm health
by Vaibhav Jain
The PAPR standard[1][3] provides mechanisms to query the health and
performance stats of an NVDIMM via various hcalls as described in Ref[2].
Until now these stats were never available nor exposed to the user-space
tools like 'ndctl'. This is partly due to PAPR platform not having support
for ACPI and NFIT. Hence 'ndctl' is unable to query and report the dimm
health status and a user had no way to determine the current health status
of a NDVIMM.
To overcome this limitation, this patch-set updates papr_scm kernel module
to query and fetch nvdimm health stats using hcalls described in Ref[2].
This health and performance stats are then exposed to userspace via syfs
and Dimm-Specific-Methods(DSM) issued by libndctl.
These changes coupled with proposed ndtcl changes located at Ref[4] should
provide a way for the user to retrieve NVDIMM health status using ndtcl.
Below is a sample output using proposed kernel + ndctl for PAPR NVDIMM in
a emulation environment:
# ndctl list -DH
[
{
"dev":"nmem0",
"health":{
"health_state":"fatal",
"shutdown_state":"dirty"
}
}
]
Dimm health report output on a pseries guest lpar with vPMEM or HMS
based nvdimms that are in perfectly healthy conditions:
# ndctl list -d nmem0 -H
[
{
"dev":"nmem0",
"health":{
"health_state":"ok",
"shutdown_state":"clean"
}
}
]
PAPR Dimm-Specific-Methods(DSM)
================================
As the name suggests DSMs are used by vendor specific code in libndctl to
execute certain operations or fetch certain information for NVDIMMS. DSMs
can be sent to papr_scm module via libndctl (userspace) and libnvdimm
(kernel) using the ND_CMD_CALL ioctl which can be handled in the dimm
control function papr_scm_ndctl(). For PAPR this patchset proposes a single
DSM to retrieve DIMM health, defined in the newly introduced uapi header
named 'papr_scm_dsm.h'. Support for more DSMs will be added in future.
Structure of the patch-set
==========================
The patchset starts with implementing support for fetching nvdimm health
information from PHYP and partially exposing it to user-space via nvdimm
flags.
Second & Third patches deal with implementing support for servicing DSM
commands papr_scm.
Finally the Fourth patch implements support for servicing DSM
'DSM_PAPR_SCM_HEALTH' that returns the nvdimm health information to
libndctl.
Changelog:
==========
v4..v5:
* Fixed a bug in new implementation of papr_scm_ndctl() that was triggering
a false error condition.
v3..v4:
* Restructured papr_scm_ndctl() to dispatch ND_CMD_CALL commands to a new
function named papr_scm_service_dsm() to serivice DSM requests. [Aneesh]
v2..v3:
* Updated the papr_scm_dsm.h header to be more confimant general kernel
guidelines for UAPI headers. [Aneesh]
* Changed the definition of macro PAPR_SCM_DIMM_UNARMED_MASK to not
include case when the nvdimm is unarmed because its a vPMEM
nvdimm. [Aneesh]
v1..v2:
* Restructured the patch-set based on review comments on V1 patch-set to
simplify the patch review. Multiple small patches have been combined into
single patches to reduce cross referencing that was needed in earlier
patch-set. Hence most of the patches in this patch-set as now new. [Aneesh]
* Removed the initial work done for fetch nvdimm performance statistics.
These changes will be re-proposed in a separate patch-set. [Aneesh]
* Simplified handling of versioning of 'struct
nd_papr_scm_dimm_health_stat_v1' as only one version of the structure is
currently in existence.
References:
[1]: "Power Architecture Platform Reference"
https://en.wikipedia.org/wiki/Power_Architecture_Platform_Reference
[2]: commit 58b278f568f0
("powerpc: Provide initial documentation for PAPR hcalls")
[3]: "Linux on Power Architecture Platform Reference"
https://members.openpowerfoundation.org/document/dl/469
[4]: https://patchwork.kernel.org/project/linux-nvdimm/list/?series=244625
Vaibhav Jain (4):
powerpc/papr_scm: Fetch nvdimm health information from PHYP
ndctl/uapi: Introduce NVDIMM_FAMILY_PAPR_SCM as a new NVDIMM DSM
family
powerpc/papr_scm,uapi: Add support for handling PAPR DSM commands
powerpc/papr_scm: Implement support for DSM_PAPR_SCM_HEALTH
arch/powerpc/include/asm/papr_scm.h | 48 ++++
arch/powerpc/include/uapi/asm/papr_scm_dsm.h | 201 ++++++++++++++
arch/powerpc/platforms/pseries/papr_scm.c | 277 ++++++++++++++++++-
include/uapi/linux/ndctl.h | 1 +
4 files changed, 519 insertions(+), 8 deletions(-)
create mode 100644 arch/powerpc/include/asm/papr_scm.h
create mode 100644 arch/powerpc/include/uapi/asm/papr_scm_dsm.h
--
2.25.1
9 months
[PATCH 00/20] virtiofs: Add DAX support
by Vivek Goyal
Hi,
This patch series adds DAX support to virtiofs filesystem. This allows
bypassing guest page cache and allows mapping host page cache directly
in guest address space.
When a page of file is needed, guest sends a request to map that page
(in host page cache) in qemu address space. Inside guest this is
a physical memory range controlled by virtiofs device. And guest
directly maps this physical address range using DAX and hence gets
access to file data on host.
This can speed up things considerably in many situations. Also this
can result in substantial memory savings as file data does not have
to be copied in guest and it is directly accessed from host page
cache.
Most of the changes are limited to fuse/virtiofs. There are couple
of changes needed in generic dax infrastructure and couple of changes
in virtio to be able to access shared memory region.
These patches apply on top of 5.6-rc4 and are also available here.
https://github.com/rhvgoyal/linux/commits/vivek-04-march-2020
Any review or feedback is welcome.
Performance
===========
I have basically run bunch of fio jobs to get a sense of speed of
various operations. I wrote a simple wrapper script to run fio jobs
3 times and take their average and report it. These scripts and fio
jobs are available here.
https://github.com/rhvgoyal/virtiofs-tests
I set up a directory on ramfs on host and exported that directory inside
guest using virtio-fs and ran tests inside guests. Ran tests with
cache=none both with dax enabled and disabled. cache=none option
enforces no caching happens in guest both for data and metadata.
Test Setup
-----------
- A fedora 29 host with 376Gi RAM, 2 sockets (20 cores per socket, 2
threads per core)
- Using ramfs on host as backing store. 4 fio files of 8G each.
- Created a VM with 64 VCPUS and 64GB memory. An 64GB cache window (for dax
mmap).
Test Results
------------
- Results in two configurations have been reported.
virtio-fs (cache=none) and virtio-fs (cache=none + dax).
There are other caching modes as well but to me cache=none seemed most
interesting for now because it does not cache anything in guest
and provides strong coherence. Other modes which provide less strong
coherence and hence are faster are yet to be benchmarked.
- Three fio ioengines psync, libaio and mmap have been used.
- I/O Workload of randread, radwrite, seqread and seqwrite have been run.
- Each file size is 8G. Block size 4K. iodepth=16
- "multi" means same operation was done with 4 jobs and each job is
operating on a file of size 8G.
- Some results are "0 (KiB/s)". That means that particular operation is
not supported in that configuration.
NAME I/O Operation BW(Read/Write)
virtiofs-cache-none seqread-psync 35(MiB/s)
virtiofs-cache-none-dax seqread-psync 643(MiB/s)
virtiofs-cache-none seqread-psync-multi 219(MiB/s)
virtiofs-cache-none-dax seqread-psync-multi 2132(MiB/s)
virtiofs-cache-none seqread-mmap 0(KiB/s)
virtiofs-cache-none-dax seqread-mmap 741(MiB/s)
virtiofs-cache-none seqread-mmap-multi 0(KiB/s)
virtiofs-cache-none-dax seqread-mmap-multi 2530(MiB/s)
virtiofs-cache-none seqread-libaio 293(MiB/s)
virtiofs-cache-none-dax seqread-libaio 425(MiB/s)
virtiofs-cache-none seqread-libaio-multi 207(MiB/s)
virtiofs-cache-none-dax seqread-libaio-multi 1543(MiB/s)
virtiofs-cache-none randread-psync 36(MiB/s)
virtiofs-cache-none-dax randread-psync 572(MiB/s)
virtiofs-cache-none randread-psync-multi 211(MiB/s)
virtiofs-cache-none-dax randread-psync-multi 1764(MiB/s)
virtiofs-cache-none randread-mmap 0(KiB/s)
virtiofs-cache-none-dax randread-mmap 719(MiB/s)
virtiofs-cache-none randread-mmap-multi 0(KiB/s)
virtiofs-cache-none-dax randread-mmap-multi 2005(MiB/s)
virtiofs-cache-none randread-libaio 300(MiB/s)
virtiofs-cache-none-dax randread-libaio 413(MiB/s)
virtiofs-cache-none randread-libaio-multi 327(MiB/s)
virtiofs-cache-none-dax randread-libaio-multi 1326(MiB/s)
virtiofs-cache-none seqwrite-psync 34(MiB/s)
virtiofs-cache-none-dax seqwrite-psync 494(MiB/s)
virtiofs-cache-none seqwrite-psync-multi 223(MiB/s)
virtiofs-cache-none-dax seqwrite-psync-multi 1680(MiB/s)
virtiofs-cache-none seqwrite-mmap 0(KiB/s)
virtiofs-cache-none-dax seqwrite-mmap 1217(MiB/s)
virtiofs-cache-none seqwrite-mmap-multi 0(KiB/s)
virtiofs-cache-none-dax seqwrite-mmap-multi 2359(MiB/s)
virtiofs-cache-none seqwrite-libaio 282(MiB/s)
virtiofs-cache-none-dax seqwrite-libaio 348(MiB/s)
virtiofs-cache-none seqwrite-libaio-multi 320(MiB/s)
virtiofs-cache-none-dax seqwrite-libaio-multi 1255(MiB/s)
virtiofs-cache-none randwrite-psync 32(MiB/s)
virtiofs-cache-none-dax randwrite-psync 458(MiB/s)
virtiofs-cache-none randwrite-psync-multi 213(MiB/s)
virtiofs-cache-none-dax randwrite-psync-multi 1343(MiB/s)
virtiofs-cache-none randwrite-mmap 0(KiB/s)
virtiofs-cache-none-dax randwrite-mmap 663(MiB/s)
virtiofs-cache-none randwrite-mmap-multi 0(KiB/s)
virtiofs-cache-none-dax randwrite-mmap-multi 1820(MiB/s)
virtiofs-cache-none randwrite-libaio 292(MiB/s)
virtiofs-cache-none-dax randwrite-libaio 341(MiB/s)
virtiofs-cache-none randwrite-libaio-multi 322(MiB/s)
virtiofs-cache-none-dax randwrite-libaio-multi 1094(MiB/s)
Conclusion
===========
- virtio-fs with dax enabled is significantly faster and memory
effiecient as comapred to non-dax operation.
Note:
Right now dax window is 64G and max fio file size is 32G as well (4
files of 8G each). That means everything fits into dax window and no
reclaim is needed. Dax window reclaim logic is slower and if file
size is bigger than dax window size, performance slows down.
Thanks
Vivek
Sebastien Boeuf (3):
virtio: Add get_shm_region method
virtio: Implement get_shm_region for PCI transport
virtio: Implement get_shm_region for MMIO transport
Stefan Hajnoczi (2):
virtio_fs, dax: Set up virtio_fs dax_device
fuse,dax: add DAX mmap support
Vivek Goyal (15):
dax: Modify bdev_dax_pgoff() to handle NULL bdev
dax: Create a range version of dax_layout_busy_page()
virtiofs: Provide a helper function for virtqueue initialization
fuse: Get rid of no_mount_options
fuse,virtiofs: Add a mount option to enable dax
fuse,virtiofs: Keep a list of free dax memory ranges
fuse: implement FUSE_INIT map_alignment field
fuse: Introduce setupmapping/removemapping commands
fuse, dax: Implement dax read/write operations
fuse, dax: Take ->i_mmap_sem lock during dax page fault
fuse,virtiofs: Define dax address space operations
fuse,virtiofs: Maintain a list of busy elements
fuse: Release file in process context
fuse: Take inode lock for dax inode truncation
fuse,virtiofs: Add logic to free up a memory range
drivers/dax/super.c | 3 +-
drivers/virtio/virtio_mmio.c | 32 +
drivers/virtio/virtio_pci_modern.c | 107 +++
fs/dax.c | 66 +-
fs/fuse/dir.c | 2 +
fs/fuse/file.c | 1162 +++++++++++++++++++++++++++-
fs/fuse/fuse_i.h | 109 ++-
fs/fuse/inode.c | 148 +++-
fs/fuse/virtio_fs.c | 250 +++++-
include/linux/dax.h | 6 +
include/linux/virtio_config.h | 17 +
include/uapi/linux/fuse.h | 42 +-
include/uapi/linux/virtio_fs.h | 3 +
include/uapi/linux/virtio_mmio.h | 11 +
include/uapi/linux/virtio_pci.h | 11 +-
15 files changed, 1888 insertions(+), 81 deletions(-)
--
2.20.1
9 months, 1 week
[PATCH v4 00/25] Add support for OpenCAPI Persistent Memory devices
by Alastair D'Silva
This series adds support for OpenCAPI Persistent Memory devices on bare metal (arch/powernv), exposing them as nvdimms so that we can make use of the existing infrastructure. There already exists a driver for the same devices abstracted through PowerVM (arch/pseries): arch/powerpc/platforms/pseries/papr_scm.c
These devices are connected via OpenCAPI, and present as LPC (lowest coherence point) memory to the system, practically, that means that memory on these cards could be treated as conventional, cache-coherent memory.
Since the devices are connected via OpenCAPI, they are not enumerated via ACPI. Instead, OpenCAPI links present as pseudo-PCI bridges, with devices below them.
This series introduces a driver that exposes the memory on these cards as nvdimms, with each card getting it's own bus. This is somewhat complicated by the fact that the cards do not have out of band persistent storage for metadata, so 1 SECTION_SIZE's (see SPARSEMEM) worth of storage is carved out of the top of the card storage to implement the ndctl_config_* calls.
The driver is not responsible for configuring the NPU (NVLink Processing Unit) BARs to map the LPC memory from the card into the system's physical address space, instead, it requests this to be done via OPAL calls (typically implemented by Skiboot).
The series is structured as follows:
- Required infrastructure changes & cleanup
- A minimal driver implementation
- Implementing additional features within the driver
Changelog:
V4:
- Rebase on next-20200320
- Bump copyright to 2020
- Ensure all uapi headers use C89 compatible comments (missed ocxlpmem.h)
- Move the driver back to drivers/nvdimm/ocxl, after confirmation
that this location is desirable
- Rename ocxl.c to ocxlpmem.c (+ support files)
- Rename all ocxl_pmem to ocxlpmem
- Address checkpatch --strict issues
- "powerpc/powernv: Add OPAL calls for LPC memory alloc/release"
- Pass base address as __be64
- "ocxl: Tally up the LPC memory on a link & allow it to be mapped"
- Address checkpatch spacing warnings
- Reword blurb
- Reword size description for ocxl_link_add_lpc_mem()
- Add an early exit in ocxl_link_lpc_release() to avoid triggering
bogus warnings if called after ocxl_link_lpc_map() fails
- "powerpc/powernv: Add OPAL calls for LPC memory alloc/release"
- Reword blurb
- "powerpc/powernv: Map & release OpenCAPI LPC memory"
- Reword blurb
- Move minor_idr init from file_init() to ocxlpmem_init() (fixes runtime error
in "nvdimm: Add driver for OpenCAPI Persistent Memory")
- Wrap long lines
- "nvdimm: Add driver for OpenCAPI Storage Class Memory"
- Remove '+ 1' workround from serial number->cookie assignment
- Drop out of memory message for ocxlpmem in probe()
- Fix leaks of ocxlpmem & ocxlpmem->ocxl_fn in probe()
- remove struct ocxlpmem_function0, it didn't value add
- factor out err_unregistered label in probe
- Address more checkpatch warnings
- get/put the pci dev on probe/free
- Drop ocxlpmem_ prefix from static functions
- Propogate errors up from called functions in probe()
- Set MODULE_LICENSE to GPLv2
- Add myself as module author
- Call nvdimm_bus_unregister() in remove() to release references
- Don't call devm_memunmap on metadata_address, the release handler on
the device already deals with this
- "nvdimm/ocxl: Read the capability registers & wait for device ready"
- Fix mask for read_latency
- Fold in is_usable logic into timeout to remove error message race
- propogate bad rc from read_device_metadata
- "nvdimm/ocxl: Add register addresses & status values to the header"
- Add comments for register abbreviations where names have been
expanded
- Add missing status for blocked on background task
- Alias defines for firmware update status to show that the duplication
of values is intentional
- "nvdimm/ocxl: Register a character device for userspace to interact with"
- Add lock around minors IDR, delete the cdev before device_unregister
- Propogate errors up from called functions in probe()
- "nvdimm/ocxl: Add support for Admin commands"
- Fix typo in setup_command_data error message, and drop 'ocxl' from it
- Drop vestigial CHI read from admin_command_request
- Change command ID mismatch message to dev_err, and return an error
- Use jiffies to implement admin_command_complete_timeout()
- Flesh out blurb
- Create a wrapper to issue the command & wait for timeout
- "nvdimm/ocxl: Add support for near storage commands"
- dropped (will submit with the patches for nvdimm overwrite)
- "nvdimm/ocxl: Implement the Read Error Log command"
- Remove stray blank line
- change misplaced goto to an early exit in read_error_log
- Inline error_log_offset_0x08
- Read WWID data as LE rather than host endian
- Move the include of nvdimm/ocxlpmem.h to ocxl.c
- Add padding after fwrevision in struct ioctl_ocxl_pmem_error_log
- Register IOCTL magic
- Coerce pointers to __u64 in IOCTLs
- "nvdimm/ocxl: Add controller dump IOCTLs"
- Coerce pointers to __u64 in IOCTLs
- Document expected IOCTL usage in blurb
- Add missing rc check
- Only populate up to the number of bytes returned by the card,
and return this length to the caller
- Add missing header check
- "nvdimm/ocxl: Add an IOCTL to report controller statistics"
- Update to match the latest version of the spec
- Verify that parametr block IDs & lengths match what we expect
- Use defines for offsets
- "nvdimm/ocxl: Forward events to userspace"
- Don't enable NSCRA doorbell
- return -EBUSY if the event context is already used
- return -ENODEV if IRQs cannot be mapped
- Tag IRQ pointers with __iomem
- Drop ocxlpmem_ prefix from static functions
- Propogate error from eventfd_ctx_fdget
- Fix error check in copy_to_user
- Drop GLOBAL_MMIO_CHI_NSCRA (this should be in the overwrite patch)
- Drop unused irq_pgmap
- Don't redef BIT_ULL
- "nvdimm/ocxl: Add debug IOCTLs"
- Eliminate clearing loop (now done in admin_command_execute()
- Drop dummy IOCTLs if CONFIG_OCXL_PMEM_DEBUG is not set
- Group debug IOCTLs together & comment that they may not be available
- "nvdimm/ocxl: Expose SMART data via ndctl"
- Drop 'rc = 0; goto out;'
- Propogate errors from ndctl_smart()
- "nvdimm/ocxl: Expose the serial number in sysfs" & "nvdimm/ocxl: Expose the firmware version in sysfs"
- Squash these 2 patches together
- Expose data as a DIMM attribute rather than an ocxlpmem
attribute
- "nvdimm/ocxl: Add an IOCTL to request controller health & perf data"
- Reword blurb
- "nvdimm/ocxl: Implement the heartbeat command"
- Propogate rc in probe()
V3:
- Rebase against next/next-20200220
- Move driver to arch/powerpc/platforms/powernv, we now expect this
driver to go upstream via the powerpc tree
- "nvdimm/ocxl: Implement the Read Error Log command"
- Fix bad header path
- "nvdimm/ocxl: Read the capability registers & wait for device ready"
- Fix overlapping masks between readiness_timeout & memory_available_timeout
- "nvdimm: Add driver for OpenCAPI Storage Class Memory"
- Address minor review comments from Jonathan Cameron
- Remove attributes
- Default to module if building LIBNVDIMM
- Propogate errors up from called functions in probe()
- "nvdimm/ocxl: Expose SMART data via ndctl"
- Pack attributes in struct
- Support different size SMART buffers for compatibility with newer
ndctls that may want more SMART attribs than we provide
- Rework to to use ND_CMD_CALL instead of ND_CMD_SMART
- drop "ocxl: Free detached contexts in ocxl_context_detach_all()"
- "powerpc: Map & release OpenCAPI LPC memory"
- Remove 'extern'
- Only available with CONFIG_MEMORY_HOTPLUG_SPARSE
- "ocxl: Tally up the LPC memory on a link & allow it to be mapped"
- Address minor review comments from Jonathan Cameron
- "ocxl: Add functions to map/unmap LPC memory"
- Split detected memory message into a separate patch
- Address minor review comments from Jonathan Cameron
- Add a comment explaining why unmap_lpc_mem is in deconfigure_afu
- "nvdimm/ocxl: Add support for Admin commands"
- use sizeof(u64) rather than 0x08 when iterating u64s
- "nvdimm/ocxl: Implement the heartbeat command"
- Fix typo in blurb
- Address kernel doc issues
- Ensure all uapi headers use C89 compatible comments
- Drop patches for firmware update & overwrite, these will be
submitted later once patches are available for ndctl
- Rename SCM to OpenCAPI Persistent Memory
V2:
- "powerpc: Map & release OpenCAPI LPC memory"
- Fix #if -> #ifdef
- use pci_dev_id to get the bdfn
- use __be64 to hold be data
- indent check_hotplug_memory_addressable correctly
- Remove export of check_hotplug_memory_addressable
- "ocxl: Conditionally bind SCM devices to the generic OCXL driver"
- Improve patch description and remove redundant default
- "nvdimm: Add driver for OpenCAPI Storage Class Memory"
- Mark a few funcs as static as identified by the 0day bot
- Add OCXL dependancies to OCXL_SCM
- Use memcpy_mcsafe in scm_ndctl_config_read
- Rename scm_foo_offset_0x00 to scm_foo_header_parse & add docs
- Name DIMM attribs "ocxl" rather than "scm"
- Split out into base + many feature patches
- "powerpc: Enable OpenCAPI Storage Class Memory driver on bare metal"
- Build DEV_DAX & friends as modules
- "ocxl: Conditionally bind SCM devices to the generic OCXL driver"
- Patch dropped (easy enough to maintain this out of tree for development)
- "ocxl: Tally up the LPC memory on a link & allow it to be mapped"
- Add a warning if an unmatched lpc_release is called
- "ocxl: Add functions to map/unmap LPC memory"
- Use EXPORT_SYMBOL_GPL
Alastair D'Silva (25):
powerpc/powernv: Add OPAL calls for LPC memory alloc/release
mm/memory_hotplug: Allow check_hotplug_memory_addressable to be called
from drivers
powerpc/powernv: Map & release OpenCAPI LPC memory
ocxl: Remove unnecessary externs
ocxl: Address kernel doc errors & warnings
ocxl: Tally up the LPC memory on a link & allow it to be mapped
ocxl: Add functions to map/unmap LPC memory
ocxl: Emit a log message showing how much LPC memory was detected
ocxl: Save the device serial number in ocxl_fn
nvdimm: Add driver for OpenCAPI Persistent Memory
powerpc: Enable the OpenCAPI Persistent Memory driver for
powernv_defconfig
nvdimm/ocxl: Add register addresses & status values to the header
nvdimm/ocxl: Read the capability registers & wait for device ready
nvdimm/ocxl: Add support for Admin commands
nvdimm/ocxl: Register a character device for userspace to interact
with
nvdimm/ocxl: Implement the Read Error Log command
nvdimm/ocxl: Add controller dump IOCTLs
nvdimm/ocxl: Add an IOCTL to report controller statistics
nvdimm/ocxl: Forward events to userspace
nvdimm/ocxl: Add an IOCTL to request controller health & perf data
nvdimm/ocxl: Implement the heartbeat command
nvdimm/ocxl: Add debug IOCTLs
nvdimm/ocxl: Expose SMART data via ndctl
nvdimm/ocxl: Expose the serial number & firmware version in sysfs
MAINTAINERS: Add myself & nvdimm/ocxl to ocxl
.../userspace-api/ioctl/ioctl-number.rst | 1 +
MAINTAINERS | 3 +
arch/powerpc/configs/powernv_defconfig | 5 +
arch/powerpc/include/asm/opal-api.h | 2 +
arch/powerpc/include/asm/opal.h | 2 +
arch/powerpc/include/asm/pnv-ocxl.h | 42 +-
arch/powerpc/platforms/powernv/ocxl.c | 43 +
arch/powerpc/platforms/powernv/opal-call.c | 2 +
drivers/misc/ocxl/config.c | 74 +-
drivers/misc/ocxl/core.c | 61 +
drivers/misc/ocxl/link.c | 60 +
drivers/misc/ocxl/ocxl_internal.h | 45 +-
drivers/nvdimm/Kconfig | 2 +
drivers/nvdimm/Makefile | 1 +
drivers/nvdimm/ocxl/Kconfig | 21 +
drivers/nvdimm/ocxl/Makefile | 7 +
drivers/nvdimm/ocxl/main.c | 1975 +++++++++++++++++
drivers/nvdimm/ocxl/ocxlpmem.h | 197 ++
drivers/nvdimm/ocxl/ocxlpmem_internal.c | 280 +++
include/linux/memory_hotplug.h | 5 +
include/misc/ocxl.h | 122 +-
include/uapi/linux/ndctl.h | 1 +
include/uapi/nvdimm/ocxlpmem.h | 127 ++
mm/memory_hotplug.c | 4 +-
24 files changed, 2983 insertions(+), 99 deletions(-)
create mode 100644 drivers/nvdimm/ocxl/Kconfig
create mode 100644 drivers/nvdimm/ocxl/Makefile
create mode 100644 drivers/nvdimm/ocxl/main.c
create mode 100644 drivers/nvdimm/ocxl/ocxlpmem.h
create mode 100644 drivers/nvdimm/ocxl/ocxlpmem_internal.c
create mode 100644 include/uapi/nvdimm/ocxlpmem.h
--
2.24.1
9 months, 3 weeks
[PATCH v6 0/6] dax/pmem: Provide a dax operation to zero page range
by Vivek Goyal
Hi,
This is V6 of patches. These patches are also available at.
Changes since V5:
- Dan Williams preferred ->zero_page_range() to only accept PAGE_SIZE
aligned request and clear poison only on page size aligned zeroing. So
I changed it accordingly.
- Dropped all the modifications which were required to support arbitrary
range zeroing with-in a page.
- This patch series also fixes the issue where "truncate -s 512 foo.txt"
will fail if first sector of file is poisoned. Currently it succeeds
and filesystem expectes whole of the filesystem block to be free of
poison at the end of the operation.
Christoph, I have dropped your Reviewed-by tag on 1-2 patches because
these patches changed substantially. Especially signature of of
dax zero_page_range() helper.
Thanks
Vivek
Vivek Goyal (6):
pmem: Add functions for reading/writing page to/from pmem
dax, pmem: Add a dax operation zero_page_range
s390,dcssblk,dax: Add dax zero_page_range operation to dcssblk driver
dm,dax: Add dax zero_page_range operation
dax: Use new dax zero page method for zeroing a page
dax,iomap: Add helper dax_iomap_zero() to zero a range
drivers/dax/super.c | 20 ++++++++
drivers/md/dm-linear.c | 18 +++++++
drivers/md/dm-log-writes.c | 17 ++++++
drivers/md/dm-stripe.c | 23 +++++++++
drivers/md/dm.c | 30 +++++++++++
drivers/nvdimm/pmem.c | 97 ++++++++++++++++++++++-------------
drivers/s390/block/dcssblk.c | 15 ++++++
fs/dax.c | 59 ++++++++++-----------
fs/iomap/buffered-io.c | 9 +---
include/linux/dax.h | 21 +++-----
include/linux/device-mapper.h | 3 ++
11 files changed, 221 insertions(+), 91 deletions(-)
--
2.20.1
9 months, 3 weeks
[PATCH] memcpy_flushcache: use cache flusing for larger lengths
by Mikulas Patocka
I tested dm-writecache performance on a machine with Optane nvdimm and it
turned out that for larger writes, cached stores + cache flushing perform
better than non-temporal stores. This is the throughput of dm-writecache
measured with this command:
dd if=/dev/zero of=/dev/mapper/wc bs=64 oflag=direct
block size 512 1024 2048 4096
movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s
clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s
We can see that for smaller block, movnti performs better, but for larger
blocks, clflushopt has better performance.
This patch changes the function __memcpy_flushcache accordingly, so that
with size >= 768 it performs cached stores and cache flushing. Note that
we must not use the new branch if the CPU doesn't have clflushopt - in
that case, the kernel would use inefficient "clflush" instruction that has
very bad performance.
Signed-off-by: Mikulas Patocka <mpatocka(a)redhat.com>
---
arch/x86/lib/usercopy_64.c | 36 ++++++++++++++++++++++++++++++++++++
1 file changed, 36 insertions(+)
Index: linux-2.6/arch/x86/lib/usercopy_64.c
===================================================================
--- linux-2.6.orig/arch/x86/lib/usercopy_64.c 2020-03-24 15:15:36.644945091 -0400
+++ linux-2.6/arch/x86/lib/usercopy_64.c 2020-03-29 13:16:49.937011736 -0400
@@ -152,6 +152,42 @@ void __memcpy_flushcache(void *_dst, con
return;
}
+ if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) && size >= 768) {
+ while (!IS_ALIGNED(dest, 64)) {
+ asm("movq (%0), %%r8\n"
+ "movnti %%r8, (%1)\n"
+ :: "r" (source), "r" (dest)
+ : "memory", "r8");
+ dest += 8;
+ source += 8;
+ size -= 8;
+ }
+ do {
+ asm("movq (%0), %%r8\n"
+ "movq 8(%0), %%r9\n"
+ "movq 16(%0), %%r10\n"
+ "movq 24(%0), %%r11\n"
+ "movq %%r8, (%1)\n"
+ "movq %%r9, 8(%1)\n"
+ "movq %%r10, 16(%1)\n"
+ "movq %%r11, 24(%1)\n"
+ "movq 32(%0), %%r8\n"
+ "movq 40(%0), %%r9\n"
+ "movq 48(%0), %%r10\n"
+ "movq 56(%0), %%r11\n"
+ "movq %%r8, 32(%1)\n"
+ "movq %%r9, 40(%1)\n"
+ "movq %%r10, 48(%1)\n"
+ "movq %%r11, 56(%1)\n"
+ :: "r" (source), "r" (dest)
+ : "memory", "r8", "r9", "r10", "r11");
+ clflushopt((void *)dest);
+ dest += 64;
+ source += 64;
+ size -= 64;
+ } while (size >= 64);
+ }
+
/* 4x8 movnti loop */
while (size >= 32) {
asm("movq (%0), %%r8\n"
9 months, 3 weeks
[PATCH 3/7] dax: Add missing annotation for wait_entry_unlocked()
by Jules Irenge
Sparse reports a warning at wait_entry_unlocked()
warning: context imbalance in wait_entry_unlocked()
- unexpected unlock
The root cause is the missing annotation at wait_entry_unlocked()
Add the missing __releases(xa) annotation.
Signed-off-by: Jules Irenge <jbi.octave(a)gmail.com>
---
fs/dax.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/fs/dax.c b/fs/dax.c
index 1f1f0201cad1..adcd2a57fbad 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -244,6 +244,7 @@ static void *get_unlocked_entry(struct xa_state *xas, unsigned int order)
* After we call xas_unlock_irq(), we cannot touch xas->xa.
*/
static void wait_entry_unlocked(struct xa_state *xas, void *entry)
+ __releases(xa)
{
struct wait_exceptional_entry_queue ewait;
wait_queue_head_t *wq;
--
2.24.1
9 months, 3 weeks
[ndctl PATCH] monitor: Add epoll timeout for forcing a full dimm health check
by Vaibhav Jain
This patch adds a new command argument to the 'monitor' command namely
'--check-interval' that triggers a call to notify_dimm_event() at
regular intervals forcing a periodic check of dimm smart events.
This behavior is useful for dimms that do not support event notifications
in case the health status of an nvdimm changes. This is especially
true in case of PAPR-SCM dimms as the PHYP hyper-visor doesn't provide
any notifications to the guest kernel on a change in nvdimm health
status. In such case periodic polling of the is the only way to track
the health of a nvdimm.
The patch updates monitor_event() adding a timeout value to
epoll_wait() call. Also to prevent the possibility of a single dimm
generating enough events thereby preventing check of health status of
other nvdimms, a 'fullpoll_ts' time-stamp is added to keep track of
when full health check of all dimms happened. If after epoll_wait()
returns 'fullpoll_ts' time-stamp indicates last full dimm health check
happened beyond 'check-interval' seconds then a full dimm health check
is enforced.
Signed-off-by: Vaibhav Jain <vaibhav(a)linux.ibm.com>
---
Documentation/ndctl/ndctl-monitor.txt | 4 ++++
ndctl/monitor.c | 31 ++++++++++++++++++++++++---
2 files changed, 32 insertions(+), 3 deletions(-)
diff --git a/Documentation/ndctl/ndctl-monitor.txt b/Documentation/ndctl/ndctl-monitor.txt
index 2239f047266d..14cc59d57157 100644
--- a/Documentation/ndctl/ndctl-monitor.txt
+++ b/Documentation/ndctl/ndctl-monitor.txt
@@ -108,6 +108,10 @@ will not work if "--daemon" is specified.
The monitor will attempt to enable the alarm control bits for all
specified events.
+-i::
+--check-interval=::
+ Force a recheck of dimm health every <n> seconds.
+
-u::
--human::
Output monitor notification as human friendly json format instead
diff --git a/ndctl/monitor.c b/ndctl/monitor.c
index 1755b87a5eeb..b72c5852524e 100644
--- a/ndctl/monitor.c
+++ b/ndctl/monitor.c
@@ -4,6 +4,7 @@
#include <stdio.h>
#include <json-c/json.h>
#include <libgen.h>
+#include <time.h>
#include <dirent.h>
#include <util/json.h>
#include <util/filter.h>
@@ -33,6 +34,7 @@ static struct monitor {
bool daemon;
bool human;
bool verbose;
+ unsigned int poll_timeout;
unsigned int event_flags;
struct log_ctx ctx;
} monitor;
@@ -322,9 +324,14 @@ static int monitor_event(struct ndctl_ctx *ctx,
struct monitor_filter_arg *mfa)
{
struct epoll_event ev, *events;
- int nfds, epollfd, i, rc = 0;
+ int nfds, epollfd, i, rc = 0, polltimeout = -1;
struct monitor_dimm *mdimm;
char buf;
+ /* last time a full poll happened */
+ struct timespec fullpoll_ts, ts;
+
+ if (monitor.poll_timeout)
+ polltimeout = monitor.poll_timeout * 1000;
events = calloc(mfa->num_dimm, sizeof(struct epoll_event));
if (!events) {
@@ -354,14 +361,30 @@ static int monitor_event(struct ndctl_ctx *ctx,
}
}
+ clock_gettime(CLOCK_BOOTTIME, &fullpoll_ts);
while (1) {
did_fail = 0;
- nfds = epoll_wait(epollfd, events, mfa->num_dimm, -1);
- if (nfds <= 0 && errno != EINTR) {
+ nfds = epoll_wait(epollfd, events, mfa->num_dimm, polltimeout);
+ if (nfds < 0 && errno != EINTR) {
err(&monitor, "epoll_wait error: (%s)\n", strerror(errno));
rc = -errno;
goto out;
}
+
+ /* If needed force a full poll of dimm health */
+ clock_gettime(CLOCK_BOOTTIME, &ts);
+ if ((fullpoll_ts.tv_sec - ts.tv_sec) > monitor.poll_timeout) {
+ nfds = 0;
+ dbg(&monitor, "forcing a full poll\n");
+ }
+
+ /* If we timed out then fill events array with all dimms */
+ if (nfds == 0) {
+ list_for_each(&mfa->dimms, mdimm, list)
+ events[nfds++].data.ptr = mdimm;
+ fullpoll_ts = ts;
+ }
+
for (i = 0; i < nfds; i++) {
mdimm = events[i].data.ptr;
if (util_dimm_event_filter(mdimm, monitor.event_flags)) {
@@ -570,6 +593,8 @@ int cmd_monitor(int argc, const char **argv, struct ndctl_ctx *ctx)
"use human friendly output formats"),
OPT_BOOLEAN('v', "verbose", &monitor.verbose,
"emit extra debug messages to log"),
+ OPT_UINTEGER('i', "check-interval", &monitor.poll_timeout,
+ "force a dimm health recheck every <n> seconds"),
OPT_END(),
};
const char * const u[] = {
--
2.24.1
9 months, 3 weeks