[PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
by Mike Rapoport
From: Mike Rapoport <rppt(a)linux.ibm.com>
Hi,
@Andrew, this is based on v5.11-rc5-mmotm-2021-01-27-23-30, with secretmem
and related patches dropped from there, I can rebase whatever way you
prefer.
This is an implementation of "secret" mappings backed by a file descriptor.
The file descriptor backing secret memory mappings is created using a
dedicated memfd_secret system call The desired protection mode for the
memory is configured using flags parameter of the system call. The mmap()
of the file descriptor created with memfd_secret() will create a "secret"
memory mapping. The pages in that mapping will be marked as not present in
the direct map and will be present only in the page table of the owning mm.
Although normally Linux userspace mappings are protected from other users,
such secret mappings are useful for environments where a hostile tenant is
trying to trick the kernel into giving them access to other tenants
mappings.
Additionally, in the future the secret mappings may be used as a mean to
protect guest memory in a virtual machine host.
For demonstration of secret memory usage we've created a userspace library
https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloa...
that does two things: the first is act as a preloader for openssl to
redirect all the OPENSSL_malloc calls to secret memory meaning any secret
keys get automatically protected this way and the other thing it does is
expose the API to the user who needs it. We anticipate that a lot of the
use cases would be like the openssl one: many toolkits that deal with
secret keys already have special handling for the memory to try to give
them greater protection, so this would simply be pluggable into the
toolkits without any need for user application modification.
Hiding secret memory mappings behind an anonymous file allows usage of
the page cache for tracking pages allocated for the "secret" mappings as
well as using address_space_operations for e.g. page migration callbacks.
The anonymous file may be also used implicitly, like hugetlb files, to
implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm
ABIs in the future.
Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which affects
the system performance. However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice". Hence, it is sufficient to have
secretmem disabled by default with the ability of a system administrator to
enable it at boot time.
In addition, there is also a long term goal to improve management of the
direct map.
[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@lin...
v17:
* Remove pool of large pages backing secretmem allocations, per Michal Hocko
* Add secretmem pages to unevictable LRU, per Michal Hocko
* Use GFP_HIGHUSER as secretmem mapping mask, per Michal Hocko
* Make secretmem an opt-in feature that is disabled by default
v16:
* Fix memory leak intorduced in v15
* Clean the data left from previous page user before handing the page to
the userspace
v15: https://lore.kernel.org/lkml/20210120180612.1058-1-rppt@kernel.org
* Add riscv/Kconfig update to disable set_memory operations for nommu
builds (patch 3)
* Update the code around add_to_page_cache() per Matthew's comments
(patches 6,7)
* Add fixups for build/checkpatch errors discovered by CI systems
v14: https://lore.kernel.org/lkml/20201203062949.5484-1-rppt@kernel.org
* Finally s/mod_node_page_state/mod_lruvec_page_state/
v13: https://lore.kernel.org/lkml/20201201074559.27742-1-rppt@kernel.org
* Added Reviewed-by, thanks Catalin and David
* s/mod_node_page_state/mod_lruvec_page_state/ as Shakeel suggested
Older history:
v12: https://lore.kernel.org/lkml/20201125092208.12544-1-rppt@kernel.org
v11: https://lore.kernel.org/lkml/20201124092556.12009-1-rppt@kernel.org
v10: https://lore.kernel.org/lkml/20201123095432.5860-1-rppt@kernel.org
v9: https://lore.kernel.org/lkml/20201117162932.13649-1-rppt@kernel.org
v8: https://lore.kernel.org/lkml/20201110151444.20662-1-rppt@kernel.org
v7: https://lore.kernel.org/lkml/20201026083752.13267-1-rppt@kernel.org
v6: https://lore.kernel.org/lkml/20200924132904.1391-1-rppt@kernel.org
v5: https://lore.kernel.org/lkml/20200916073539.3552-1-rppt@kernel.org
v4: https://lore.kernel.org/lkml/20200818141554.13945-1-rppt@kernel.org
v3: https://lore.kernel.org/lkml/20200804095035.18778-1-rppt@kernel.org
v2: https://lore.kernel.org/lkml/20200727162935.31714-1-rppt@kernel.org
v1: https://lore.kernel.org/lkml/20200720092435.17469-1-rppt@kernel.org
rfc-v2: https://lore.kernel.org/lkml/20200706172051.19465-1-rppt@kernel.org/
rfc-v1: https://lore.kernel.org/lkml/20200130162340.GA14232@rapoport-lnx/
rfc-v0: https://lore.kernel.org/lkml/1572171452-7958-1-git-send-email-rppt@kernel...
Arnd Bergmann (1):
arm64: kfence: fix header inclusion
Mike Rapoport (9):
mm: add definition of PMD_PAGE_ORDER
mmap: make mlock_future_check() global
riscv/Kconfig: make direct map manipulation options depend on MMU
set_memory: allow set_direct_map_*_noflush() for multiple pages
set_memory: allow querying whether set_direct_map_*() is actually enabled
mm: introduce memfd_secret system call to create "secret" memory areas
PM: hibernate: disable when there are active secretmem users
arch, mm: wire up memfd_secret system call where relevant
secretmem: test: add basic selftest for memfd_secret(2)
arch/arm64/include/asm/Kbuild | 1 -
arch/arm64/include/asm/cacheflush.h | 6 -
arch/arm64/include/asm/kfence.h | 2 +-
arch/arm64/include/asm/set_memory.h | 17 ++
arch/arm64/include/uapi/asm/unistd.h | 1 +
arch/arm64/kernel/machine_kexec.c | 1 +
arch/arm64/mm/mmu.c | 6 +-
arch/arm64/mm/pageattr.c | 23 +-
arch/riscv/Kconfig | 4 +-
arch/riscv/include/asm/set_memory.h | 4 +-
arch/riscv/include/asm/unistd.h | 1 +
arch/riscv/mm/pageattr.c | 8 +-
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/include/asm/set_memory.h | 4 +-
arch/x86/mm/pat/set_memory.c | 8 +-
fs/dax.c | 11 +-
include/linux/pgtable.h | 3 +
include/linux/secretmem.h | 30 +++
include/linux/set_memory.h | 16 +-
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 6 +-
include/uapi/linux/magic.h | 1 +
kernel/power/hibernate.c | 5 +-
kernel/power/snapshot.c | 4 +-
kernel/sys_ni.c | 2 +
mm/Kconfig | 3 +
mm/Makefile | 1 +
mm/gup.c | 10 +
mm/internal.h | 3 +
mm/mlock.c | 3 +-
mm/mmap.c | 5 +-
mm/secretmem.c | 261 +++++++++++++++++++
mm/vmalloc.c | 5 +-
scripts/checksyscalls.sh | 4 +
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 3 +-
tools/testing/selftests/vm/memfd_secret.c | 296 ++++++++++++++++++++++
tools/testing/selftests/vm/run_vmtests | 17 ++
39 files changed, 726 insertions(+), 53 deletions(-)
create mode 100644 arch/arm64/include/asm/set_memory.h
create mode 100644 include/linux/secretmem.h
create mode 100644 mm/secretmem.c
create mode 100644 tools/testing/selftests/vm/memfd_secret.c
--
2.28.0
1 month, 4 weeks
[PATCH] cxl/mem: Fixes to IOCTL interface
by Ben Widawsky
When submitting a command for userspace, input and output payload bounce
buffers are allocated. For a given command, both input and output
buffers may exist and so when allocation of the input buffer fails, the
output buffer must be freed. As far as I can tell, userspace can't
easily exploit the leak to OOM a machine unless the machine was already
near OOM state.
This bug was introduced in v5 of the patch and did not exist in prior
revisions.
While here, adjust the variable 'j' found in patch review by Konrad.
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk(a)oracle.com>
Signed-off-by: Ben Widawsky <ben.widawsky(a)intel.com>
Reviewed-by: Dan Williams <dan.j.williams(a)intel.com> (v2)
Reviewed-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
---
drivers/cxl/mem.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index df895bcca63a..626fd7066f4f 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -514,8 +514,10 @@ static int handle_mailbox_cmd_from_user(struct cxl_mem *cxlm,
if (cmd->info.size_in) {
mbox_cmd.payload_in = vmemdup_user(u64_to_user_ptr(in_payload),
cmd->info.size_in);
- if (IS_ERR(mbox_cmd.payload_in))
+ if (IS_ERR(mbox_cmd.payload_in)) {
+ kvfree(mbox_cmd.payload_out);
return PTR_ERR(mbox_cmd.payload_in);
+ }
}
rc = cxl_mem_mbox_get(cxlm);
@@ -696,7 +698,7 @@ static int cxl_query_cmd(struct cxl_memdev *cxlmd,
struct device *dev = &cxlmd->dev;
struct cxl_mem_command *cmd;
u32 n_commands;
- int j = 0;
+ int cmds = 0;
dev_dbg(dev, "Query IOCTL\n");
@@ -714,10 +716,10 @@ static int cxl_query_cmd(struct cxl_memdev *cxlmd,
cxl_for_each_cmd(cmd) {
const struct cxl_command_info *info = &cmd->info;
- if (copy_to_user(&q->commands[j++], info, sizeof(*info)))
+ if (copy_to_user(&q->commands[cmds++], info, sizeof(*info)))
return -EFAULT;
- if (j == n_commands)
+ if (cmds == n_commands)
break;
}
--
2.30.1
1 month, 4 weeks
[PATCH] device-dax: Switch to using the new API kobj_to_dev()
by Yang Li
fixed the following coccicheck:
./drivers/dax/bus.c:486:60-61: WARNING opportunity for kobj_to_dev()
./drivers/dax/bus.c:1215:60-61: WARNING opportunity for kobj_to_dev()
Reported-by: Abaci Robot <abaci(a)linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee(a)linux.alibaba.com>
---
drivers/dax/bus.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 737b207..0e9207c 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -483,7 +483,7 @@ static ssize_t delete_store(struct device *dev, struct device_attribute *attr,
static umode_t dax_region_visible(struct kobject *kobj, struct attribute *a,
int n)
{
- struct device *dev = container_of(kobj, struct device, kobj);
+ struct device *dev = kobj_to_dev(kobj);
struct dax_region *dax_region = dev_get_drvdata(dev);
if (is_static(dax_region))
@@ -1212,7 +1212,7 @@ static ssize_t numa_node_show(struct device *dev,
static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
{
- struct device *dev = container_of(kobj, struct device, kobj);
+ struct device *dev = kobj_to_dev(kobj);
struct dev_dax *dev_dax = to_dev_dax(dev);
struct dax_region *dax_region = dev_dax->region;
--
1.8.3.1
1 month, 4 weeks
[ndctl PATCH] ndctl: update .gitignore
by QI Fuli
Add Documentation/ndctl/attrs.adoc and *.lo to .gitignore.
Signed-off-by: QI Fuli <qi.fuli(a)fujitsu.com>
---
.gitignore | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/.gitignore b/.gitignore
index 3ef9ff7..53512b2 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,4 +1,5 @@
*.o
+*.lo
*.xml
.deps/
.libs/
@@ -15,13 +16,13 @@ Makefile.in
*.1
Documentation/daxctl/asciidoc.conf
Documentation/ndctl/asciidoc.conf
+Documentation/ndctl/attrs.adoc
Documentation/daxctl/asciidoctor-extensions.rb
Documentation/ndctl/asciidoctor-extensions.rb
.dirstamp
daxctl/config.h
daxctl/daxctl
daxctl/lib/libdaxctl.la
-daxctl/lib/libdaxctl.lo
daxctl/lib/libdaxctl.pc
*.a
ndctl/config.h
@@ -29,8 +30,6 @@ ndctl/lib/libndctl.pc
ndctl/ndctl
rhel/
sles/ndctl.spec
-util/log.lo
-util/sysfs.lo
version.m4
*.swp
cscope.files
--
2.29.2
1 month, 4 weeks
[PATCH] ndtest: Switch to using the new API kobj_to_dev()
by Yang Li
fixed the following coccicheck:
./tools/testing/nvdimm/test/ndtest.c:785:60-61: WARNING opportunity for
kobj_to_dev()
Reported-by: Abaci Robot <abaci(a)linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee(a)linux.alibaba.com>
---
tools/testing/nvdimm/test/ndtest.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/nvdimm/test/ndtest.c b/tools/testing/nvdimm/test/ndtest.c
index 6862915..004a36f 100644
--- a/tools/testing/nvdimm/test/ndtest.c
+++ b/tools/testing/nvdimm/test/ndtest.c
@@ -782,7 +782,7 @@ static ssize_t format1_show(struct device *dev, struct device_attribute *attr,
static umode_t ndtest_nvdimm_attr_visible(struct kobject *kobj,
struct attribute *a, int n)
{
- struct device *dev = container_of(kobj, struct device, kobj);
+ struct device *dev = kobj_to_dev(kobj);
struct nvdimm *nvdimm = to_nvdimm(dev);
struct ndtest_dimm *dimm = nvdimm_provider_data(nvdimm);
--
1.8.3.1
1 month, 4 weeks
[PATCH v5 0/9] CXL 2.0 Support
by Ben Widawsky
# Changes since v4 [1]
* Use vmemdup_user instead of open-coded (Al Viro)
* Fix when kernel docs get introduced (Ben)
* Fix unhappy sphinx '%-*' (sfr)
* Remove redundant initialization (Colin, Dan C)
* Make cxl_mem_mbox_send_cmd enforce size (Dan, Jonathan)
* Except for variable sized output (Ben)
* Fix off by one in register block enumeration (Jonathan)
* Use FIELD_GET for capability ID (Jonathan)
* Fix potential overflows on output buffer (Jonathan)
* Go back to using size_out to verify memcpy_fromio size
* Add out_size to cxl_mem_mbox_send_cmd
* UAPI change (Dan)
* Make out.size represent the actual amount written as opposed to how much
hardware wrote. The kernel docs already reflected this behavior, so it's
fair to say the change is a bug fix rather than UAPI change.
Excluding the bug fix there have been no UAPI changes since v1.
---
In addition to the mailing list, please feel free to use #cxl on oftc IRC for
discussion.
---
# Summary
Introduce support for “type-3” memory devices defined in the Compute Express
Link (CXL) 2.0 specification [2]. Specifically, these are the memory devices
defined by section 8.2.8.5 of the CXL 2.0 spec. A reference implementation
emulating these devices has been submitted to the QEMU mailing list [3] and is
available on gitlab [4], but will move to a shared tree on kernel.org after
initial acceptance. “Type-3” is a CXL device that acts as a memory expander for
RAM or Persistent Memory. The device might be interleaved with other CXL devices
in a given physical address range.
In addition to the core functionality of discovering the spec defined registers
and resources, introduce a CXL device model that will be the foundation for
translating CXL capabilities into existing Linux infrastructure for Persistent
Memory and other memory devices. For now, this only includes support for the
management command mailbox the surfacing of type-3 devices. These control
devices fill the role of “DIMMs” / nmemX memory-devices in LIBNVDIMM terms.
## Userspace Interaction
Interaction with the driver and type-3 devices via the CXL drivers is introduced
in this patch series and considered stable ABI. They include
* sysfs - Documentation/ABI/testing/sysfs-bus-cxl
* IOCTL - Documentation/driver-api/cxl/memory-devices.rst
* debugfs - Documentation/ABI/testing/debugfs-debug
Work is in process to add support for CXL interactions to the ndctl project [5]
### Development plans
One of the unique challenges that CXL imposes on the Linux driver model is that
it requires the operating system to perform physical address space management
interleaved across devices and bridges. Whereas LIBNVDIMM handles a list of
established static persistent memory address ranges (for example from the ACPI
NFIT), CXL introduces hotplug and the concept of allocating address space to
instantiate persistent memory ranges. This is similar to PCI in the sense that
the platform establishes the MMIO range for PCI BARs to be allocated, but it is
significantly complicated by the fact that a given device can optionally be
interleaved with other devices and can participate in several interleave-sets at
once. LIBNVDIMM handled something like this with the aliasing between PMEM and
BLOCK-WINDOW mode, but CXL adds flexibility to alias DEVICE MEMORY through up to
10 decoders per device.
All of the above needs to be enabled with respect to PCI hotplug events on
Type-3 memory device which needs hooks to determine if a given device is
contributing to a "System RAM" address range that is unable to be unplugged. In
other words CXL ties PCI hotplug to Memory Hotplug and PCI hotplug needs to be
able to negotiate with memory hotplug. In the medium term the implications of
CXL hotplug vs ACPI SRAT/SLIT/HMAT need to be reconciled. One capability that
seems to be needed is either the dynamic allocation of new memory nodes, or
default initializing extra pgdat instances beyond what is enumerated in ACPI
SRAT to accommodate hot-added CXL memory.
Patches welcome, questions welcome as the development effort on the post v5.12
capabilities proceeds.
## Running in QEMU
The incantation to get CXL support in QEMU [4] is considered unstable at this
time. Future readers of this cover letter should verify if any changes are
needed. For the novice QEMU user, the following can be copy/pasted into a
working QEMU commandline. It is enough to make the simplest topology possible.
The topology would consist of a single memory window, single type3 device,
single root port, and single host bridge.
+-------------+
| CXL PXB |
| |
| +-------+ |<----------+
| |CXL RP | | |
+--+-------+--+ v
| +----------+
| | "window" |
| +----------+
v ^
+-------------+ |
| CXL Type 3 | |
| Device |<----------+
+-------------+
// Memory backend for "window"
-object memory-backend-file,id=cxl-mem1,share,mem-path=cxl-type3,size=512M
// Memory backend for LSA
-object memory-backend-file,id=cxl-mem1-lsa,share,mem-path=cxl-mem1-lsa,size=1K
// Host Bridge
-device pxb-cxl id=cxl.0,bus=pcie.0,bus_nr=52,uid=0 len-window-base=1,window-base[0]=0x4c0000000 memdev[0]=cxl-mem1
// Single root port
-device cxl rp,id=rp0,bus=cxl.0,addr=0.0,chassis=0,slot=0,memdev=cxl-mem1
// Single type3 device
-device cxl-type3,bus=rp0,memdev=cxl-mem1,id=cxl-pmem0,size=256M -device cxl-type3,bus=rp1,memdev=cxl-mem1,id=cxl-pmem1,size=256M,lsa=cxl-mem1-lsa
---
[1]: https://lore.kernel.org/linux-cxl/20210216014538.268106-1-ben.widawsky@in...
[2]: https://www.computeexpresslink.org/](https://www.computeexpresslink.org/
[3]: https://lore.kernel.org/qemu-devel/20210202005948.241655-1-ben.widawsky@i...
[4]: https://gitlab.com/bwidawsk/qemu/-/tree/cxl-2.0v4
[5]: https://github.com/pmem/ndctl/tree/cxl-2.0v2
Cc: linux-acpi(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Cc: linux-nvdimm(a)lists.01.org
Cc: linux-pci(a)vger.kernel.org
Cc: Bjorn Helgaas <helgaas(a)kernel.org>
Cc: Chris Browy <cbrowy(a)avery-design.com>
Cc: Christoph Hellwig <hch(a)infradead.org>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Ira Weiny <ira.weiny(a)intel.com>
Cc: Jon Masters <jcm(a)jonmasters.org>
Cc: Jonathan Cameron <Jonathan.Cameron(a)Huawei.com>
Cc: Rafael Wysocki <rafael.j.wysocki(a)intel.com>
Cc: Randy Dunlap <rdunlap(a)infradead.org>
Cc: Vishal Verma <vishal.l.verma(a)intel.com>
Cc: "John Groves (jgroves)" <jgroves(a)micron.com>
Cc: "Kelley, Sean V" <sean.v.kelley(a)intel.com>
---
Ben Widawsky (7):
cxl/mem: Find device capabilities
cxl/mem: Add basic IOCTL interface
cxl/mem: Add a "RAW" send command
cxl/mem: Enable commands via CEL
cxl/mem: Add set of informational commands
MAINTAINERS: Add maintainers of the CXL driver
cxl/mem: Add payload dumping for debug
Dan Williams (2):
cxl/mem: Introduce a driver for CXL-2.0-Type-3 endpoints
cxl/mem: Register CXL memX devices
.clang-format | 1 +
Documentation/ABI/testing/sysfs-bus-cxl | 26 +
Documentation/driver-api/cxl/index.rst | 12 +
.../driver-api/cxl/memory-devices.rst | 46 +
Documentation/driver-api/index.rst | 1 +
.../userspace-api/ioctl/ioctl-number.rst | 1 +
MAINTAINERS | 11 +
drivers/Kconfig | 1 +
drivers/Makefile | 1 +
drivers/cxl/Kconfig | 66 +
drivers/cxl/Makefile | 7 +
drivers/cxl/bus.c | 29 +
drivers/cxl/cxl.h | 95 +
drivers/cxl/mem.c | 1553 +++++++++++++++++
drivers/cxl/pci.h | 31 +
include/linux/pci_ids.h | 1 +
include/uapi/linux/cxl_mem.h | 172 ++
17 files changed, 2054 insertions(+)
create mode 100644 Documentation/ABI/testing/sysfs-bus-cxl
create mode 100644 Documentation/driver-api/cxl/index.rst
create mode 100644 Documentation/driver-api/cxl/memory-devices.rst
create mode 100644 drivers/cxl/Kconfig
create mode 100644 drivers/cxl/Makefile
create mode 100644 drivers/cxl/bus.c
create mode 100644 drivers/cxl/cxl.h
create mode 100644 drivers/cxl/mem.c
create mode 100644 drivers/cxl/pci.h
create mode 100644 include/uapi/linux/cxl_mem.h
--
2.30.1
2 months
[PATCH v4 0/9] CXL 2.0 Support
by Ben Widawsky
# Changes since v3 [1]
* Fix use of GET_SUPPORTED_LOGS (Ben)
* Reported by Dan
* Rework userspace commands (Al, Dan)
* Don't get_user twice (Al)
* Don't pass __user @u to handle_mailbox_cmd_from_user() (Dan)
* Use void * in cxl_mem_mbox_send_cmd() (Dan)
* Fix for 32b builds (Stephen, Randy, more)
* Include io-64-nonatomic-lo-hi.h in mem.c
* Use GENMASK_ULL where appropriate
---
In addition to the mailing list, please feel free to use #cxl on oftc IRC for
discussion.
---
# Summary
Introduce support for “type-3” memory devices defined in the Compute Express
Link (CXL) 2.0 specification [2]. Specifically, these are the memory devices
defined by section 8.2.8.5 of the CXL 2.0 spec. A reference implementation
emulating these devices has been submitted to the QEMU mailing list [3] and is
available on gitlab [4], but will move to a shared tree on kernel.org after
initial acceptance. “Type-3” is a CXL device that acts as a memory expander for
RAM or Persistent Memory. The device might be interleaved with other CXL devices
in a given physical address range.
In addition to the core functionality of discovering the spec defined registers
and resources, introduce a CXL device model that will be the foundation for
translating CXL capabilities into existing Linux infrastructure for Persistent
Memory and other memory devices. For now, this only includes support for the
management command mailbox the surfacing of type-3 devices. These control
devices fill the role of “DIMMs” / nmemX memory-devices in LIBNVDIMM terms.
## Userspace Interaction
Interaction with the driver and type-3 devices via the CXL drivers is introduced
in this patch series and considered stable ABI. They include
* sysfs - Documentation/ABI/testing/sysfs-bus-cxl
* IOCTL - Documentation/driver-api/cxl/memory-devices.rst
* debugfs - Documentation/ABI/testing/debugfs-debug
Work is in process to add support for CXL interactions to the ndctl project [5]
### Development plans
One of the unique challenges that CXL imposes on the Linux driver model is that
it requires the operating system to perform physical address space management
interleaved across devices and bridges. Whereas LIBNVDIMM handles a list of
established static persistent memory address ranges (for example from the ACPI
NFIT), CXL introduces hotplug and the concept of allocating address space to
instantiate persistent memory ranges. This is similar to PCI in the sense that
the platform establishes the MMIO range for PCI BARs to be allocated, but it is
significantly complicated by the fact that a given device can optionally be
interleaved with other devices and can participate in several interleave-sets at
once. LIBNVDIMM handled something like this with the aliasing between PMEM and
BLOCK-WINDOW mode, but CXL adds flexibility to alias DEVICE MEMORY through up to
10 decoders per device.
All of the above needs to be enabled with respect to PCI hotplug events on
Type-3 memory device which needs hooks to determine if a given device is
contributing to a "System RAM" address range that is unable to be unplugged. In
other words CXL ties PCI hotplug to Memory Hotplug and PCI hotplug needs to be
able to negotiate with memory hotplug. In the medium term the implications of
CXL hotplug vs ACPI SRAT/SLIT/HMAT need to be reconciled. One capability that
seems to be needed is either the dynamic allocation of new memory nodes, or
default initializing extra pgdat instances beyond what is enumerated in ACPI
SRAT to accommodate hot-added CXL memory.
Patches welcome, questions welcome as the development effort on the post v5.12
capabilities proceeds.
## Running in QEMU
The incantation to get CXL support in QEMU [4] is considered unstable at this
time. Future readers of this cover letter should verify if any changes are
needed. For the novice QEMU user, the following can be copy/pasted into a
working QEMU commandline. It is enough to make the simplest topology possible.
The topology would consist of a single memory window, single type3 device,
single root port, and single host bridge.
+-------------+
| CXL PXB |
| |
| +-------+ |<----------+
| |CXL RP | | |
+--+-------+--+ v
| +----------+
| | "window" |
| +----------+
v ^
+-------------+ |
| CXL Type 3 | |
| Device |<----------+
+-------------+
// Memory backend for "window"
-object memory-backend-file,id=cxl-mem1,share,mem-path=cxl-type3,size=512M
// Memory backend for LSA
-object memory-backend-file,id=cxl-mem1-lsa,share,mem-path=cxl-mem1-lsa,size=1K
// Host Bridge
-device pxb-cxl id=cxl.0,bus=pcie.0,bus_nr=52,uid=0 len-window-base=1,window-base[0]=0x4c0000000 memdev[0]=cxl-mem1
// Single root port
-device cxl rp,id=rp0,bus=cxl.0,addr=0.0,chassis=0,slot=0,memdev=cxl-mem1
// Single type3 device
-device cxl-type3,bus=rp0,memdev=cxl-mem1,id=cxl-pmem0,size=256M -device cxl-type3,bus=rp1,memdev=cxl-mem1,id=cxl-pmem1,size=256M,lsa=cxl-mem1-lsa
---
[1]: https://lore.kernel.org/linux-cxl/20210212222541.2123505-1-ben.widawsky@i...
[2]: https://www.computeexpresslink.org/](https://www.computeexpresslink.org/
[3]: https://lore.kernel.org/qemu-devel/20210202005948.241655-1-ben.widawsky@i...
[4]: https://gitlab.com/bwidawsk/qemu/-/tree/cxl-2.0v4
[5]: https://github.com/pmem/ndctl/tree/cxl-2.0v2
Cc: linux-acpi(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Cc: linux-nvdimm(a)lists.01.org
Cc: linux-pci(a)vger.kernel.org
Cc: Bjorn Helgaas <helgaas(a)kernel.org>
Cc: Chris Browy <cbrowy(a)avery-design.com>
Cc: Christoph Hellwig <hch(a)infradead.org>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Ira Weiny <ira.weiny(a)intel.com>
Cc: Jon Masters <jcm(a)jonmasters.org>
Cc: Jonathan Cameron <Jonathan.Cameron(a)Huawei.com>
Cc: Rafael Wysocki <rafael.j.wysocki(a)intel.com>
Cc: Randy Dunlap <rdunlap(a)infradead.org>
Cc: Vishal Verma <vishal.l.verma(a)intel.com>
Cc: "John Groves (jgroves)" <jgroves(a)micron.com>
Cc: "Kelley, Sean V" <sean.v.kelley(a)intel.com>
---
Ben Widawsky (7):
cxl/mem: Find device capabilities
cxl/mem: Add basic IOCTL interface
cxl/mem: Add a "RAW" send command
cxl/mem: Enable commands via CEL
cxl/mem: Add set of informational commands
MAINTAINERS: Add maintainers of the CXL driver
cxl/mem: Add payload dumping for debug
Dan Williams (2):
cxl/mem: Introduce a driver for CXL-2.0-Type-3 endpoints
cxl/mem: Register CXL memX devices
.clang-format | 1 +
Documentation/ABI/testing/sysfs-bus-cxl | 26 +
Documentation/driver-api/cxl/index.rst | 12 +
.../driver-api/cxl/memory-devices.rst | 46 +
Documentation/driver-api/index.rst | 1 +
.../userspace-api/ioctl/ioctl-number.rst | 1 +
MAINTAINERS | 11 +
drivers/Kconfig | 1 +
drivers/Makefile | 1 +
drivers/cxl/Kconfig | 66 +
drivers/cxl/Makefile | 7 +
drivers/cxl/bus.c | 29 +
drivers/cxl/cxl.h | 93 +
drivers/cxl/mem.c | 1540 +++++++++++++++++
drivers/cxl/pci.h | 31 +
include/linux/pci_ids.h | 1 +
include/uapi/linux/cxl_mem.h | 170 ++
17 files changed, 2037 insertions(+)
create mode 100644 Documentation/ABI/testing/sysfs-bus-cxl
create mode 100644 Documentation/driver-api/cxl/index.rst
create mode 100644 Documentation/driver-api/cxl/memory-devices.rst
create mode 100644 drivers/cxl/Kconfig
create mode 100644 drivers/cxl/Makefile
create mode 100644 drivers/cxl/bus.c
create mode 100644 drivers/cxl/cxl.h
create mode 100644 drivers/cxl/mem.c
create mode 100644 drivers/cxl/pci.h
create mode 100644 include/uapi/linux/cxl_mem.h
--
2.30.1
2 months
[PATCH v2 0/5] dax-device: Some cleanups
by Uwe Kleine-König
Hello,
I didn't get any feedback for the (implicit) v1 of this series that
started with Message-Id: 20210127230124.109522-1-uwe(a)kleine-koenig.org,
but I identified a few improvements myself:
- Use "dax-device" consistently as a prefix
- Instead of requiring a .remove callback, make it explicitly
optional. (Drop checking for .remove from former patch 1, introduce
new patch "Properly handle drivers without remove callback")
- The new patch about remove being optional allows to simplify one of
the two dax drivers which is implemented in patch 4
- Patch 5 got a bit smaller because we now have one driver less with a
remove callback.
- Added Andrew to To: as he merged dax drivers in the past.
Andrew: Assuming you consider these patches useful, would you please
care for merging them?
Best regards
Uwe
Uwe Kleine-König (5):
dax-device: Prevent registering drivers without probe callback
dax-device: Properly handle drivers without remove callback
dax-device: Fix error path in dax_driver_register
dax-device: Drop an empty .remove callback
dax-device: Make remove callback return void
drivers/dax/bus.c | 22 ++++++++++++++++++++--
drivers/dax/bus.h | 2 +-
drivers/dax/device.c | 8 +-------
drivers/dax/kmem.c | 7 ++-----
4 files changed, 24 insertions(+), 15 deletions(-)
base-commit: 5c8fe583cce542aa0b84adc939ce85293de36e5e
--
2.29.2
2 months
[PATCH 1/2] libnvdimm: simplify nvdimm_remove()
by Uwe Kleine-König
nvdimm_remove is only ever called after nvdimm_probe() returned
successfully. In this case driver data is always set to a non-NULL value
so the check for driver data being NULL can go away as it's always false.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig(a)pengutronix.de>
---
drivers/nvdimm/dimm.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/drivers/nvdimm/dimm.c b/drivers/nvdimm/dimm.c
index 7d4ddc4d9322..94be3ae1d29f 100644
--- a/drivers/nvdimm/dimm.c
+++ b/drivers/nvdimm/dimm.c
@@ -117,9 +117,6 @@ static int nvdimm_remove(struct device *dev)
{
struct nvdimm_drvdata *ndd = dev_get_drvdata(dev);
- if (!ndd)
- return 0;
-
nvdimm_bus_lock(dev);
dev_set_drvdata(dev, NULL);
nvdimm_bus_unlock(dev);
base-commit: 5c8fe583cce542aa0b84adc939ce85293de36e5e
--
2.29.2
2 months
[PATCH] dax: fix default return code of range_parse()
by Shiyang Ruan
The return value of range_parse() indicates the size when it is
positive. The error code should be negative.
Signed-off-by: Shiyang Ruan <ruansy.fnst(a)cn.fujitsu.com>
---
drivers/dax/bus.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 737b207c9e30..3003558c1a8b 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1038,7 +1038,7 @@ static ssize_t range_parse(const char *opt, size_t len, struct range *range)
{
unsigned long long addr = 0;
char *start, *end, *str;
- ssize_t rc = EINVAL;
+ ssize_t rc = -EINVAL;
str = kstrdup(opt, GFP_KERNEL);
if (!str)
--
2.30.0
2 months