[PATCH v3 00/23] device-dax: Support sub-dividing soft-reserved
ranges
by Dan Williams
Changes since v2 [1]:
- Rebase on next/master to resolve conflicts with pending mem-hotplug
and memremap_pages() changes in -mm
- Drop attempt at a generic phys_to_target_node() implementation and
just follow the default fallback approach taken with
memory_add_physaddr_to_nid() (Mike)
- Fix test_hmm and other compilation fixups (Ralph)
- Integrate Joao's extensions to the device-dax sub-division interface
(per-device align, user-directed extent allocation). (Joao)
[1]: http://lore.kernel.org/r/159457116473.754248.7879464730875147365.stgit@dw...
---
Merge notes:
Andrew, this series is rebased on today's next/master to resolve
conflicts with some pending patches in -mm. I'd like to take it through
your tree given the intersections with memremap_pages() and memory
hotplug. If at all possible I'd like to see it in v5.10, but I realize
time is short. Outside of the Intel identified use cases for this Joao
has identified a use case for Oracle as well.
I would have sent this earlier save for the fact I am mostly offline
tending to a newborn these days. Vishal has stepped up to take on care
and feeding of this patchset if additional review / integration fixups
are needed.
The one test feedback this wants is from Justin (justin.he(a)arm.com), and
whether this lights up dax_kmem and now dax_hmem for him on arm64.
Otherwise, Joao has written unit tests for this in his enabling of the
daxctl userspace utility [2].
---
Cover:
The device-dax facility allows an address range to be directly mapped
through a chardev, or optionally hotplugged to the core kernel page
allocator as System-RAM. It is the mechanism for converting persistent
memory (pmem) to be used as another volatile memory pool i.e. the
current Memory Tiering hot topic on linux-mm.
In the case of pmem the nvdimm-namespace-label mechanism can sub-divide
it, but that labeling mechanism is not available / applicable to
soft-reserved ("EFI specific purpose") memory [3]. This series provides
a sysfs-mechanism for the daxctl utility to enable provisioning of
volatile-soft-reserved memory ranges.
The motivations for this facility are:
1/ Allow performance differentiated memory ranges to be split between
kernel-managed and directly-accessed use cases.
2/ Allow physical memory to be provisioned along performance relevant
address boundaries. For example, divide a memory-side cache [4] along
cache-color boundaries.
3/ Parcel out soft-reserved memory to VMs using device-dax as a security
/ permissions boundary [5]. Specifically I have seen people (ab)using
memmap=nn!ss (mark System-RAM as Persistent Memory) just to get the
device-dax interface on custom address ranges. A follow-on for the VM
use case is to teach device-dax to dynamically allocate 'struct page' at
runtime to reduce the duplication of 'struct page' space in both the
guest and the host kernel for the same physical pages.
[2]: http://lore.kernel.org/r/20200713160837.13774-11-joao.m.martins@oracle.com
[3]: http://lore.kernel.org/r/157309097008.1579826.12818463304589384434.stgit@...
[4]: http://lore.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@...
[5]: http://lore.kernel.org/r/20200110190313.17144-1-joao.m.martins@oracle.com
---
Dan Williams (19):
x86/numa: Cleanup configuration dependent command-line options
x86/numa: Add 'nohmat' option
efi/fake_mem: Arrange for a resource entry per efi_fake_mem instance
ACPI: HMAT: Refactor hmat_register_target_device to hmem_register_device
resource: Report parent to walk_iomem_res_desc() callback
mm/memory_hotplug: Introduce default phys_to_target_node() implementation
ACPI: HMAT: Attach a device for each soft-reserved range
device-dax: Drop the dax_region.pfn_flags attribute
device-dax: Move instance creation parameters to 'struct dev_dax_data'
device-dax: Make pgmap optional for instance creation
device-dax: Kill dax_kmem_res
device-dax: Add an allocation interface for device-dax instances
device-dax: Introduce 'seed' devices
drivers/base: Make device_find_child_by_name() compatible with sysfs inputs
device-dax: Add resize support
mm/memremap_pages: Convert to 'struct range'
mm/memremap_pages: Support multiple ranges per invocation
device-dax: Add dis-contiguous resource support
device-dax: Introduce 'mapping' devices
Joao Martins (4):
device-dax: Make align a per-device property
device-dax: Add an 'align' attribute
dax/hmem: Introduce dax_hmem.region_idle parameter
device-dax: Add a range mapping allocation attribute
arch/powerpc/kvm/book3s_hv_uvmem.c | 14
arch/x86/include/asm/numa.h | 8
arch/x86/kernel/e820.c | 16
arch/x86/mm/numa.c | 11
arch/x86/mm/numa_emulation.c | 3
arch/x86/xen/enlighten_pv.c | 2
drivers/acpi/numa/hmat.c | 76 --
drivers/acpi/numa/srat.c | 9
drivers/base/core.c | 2
drivers/dax/Kconfig | 4
drivers/dax/Makefile | 3
drivers/dax/bus.c | 1055 ++++++++++++++++++++++++++++++--
drivers/dax/bus.h | 28 +
drivers/dax/dax-private.h | 40 +
drivers/dax/device.c | 132 ++--
drivers/dax/hmem.c | 56 --
drivers/dax/hmem/Makefile | 6
drivers/dax/hmem/device.c | 100 +++
drivers/dax/hmem/hmem.c | 65 ++
drivers/dax/kmem.c | 199 +++---
drivers/dax/pmem/compat.c | 2
drivers/dax/pmem/core.c | 22 -
drivers/firmware/efi/x86_fake_mem.c | 12
drivers/gpu/drm/nouveau/nouveau_dmem.c | 15
drivers/nvdimm/badrange.c | 26 -
drivers/nvdimm/claim.c | 13
drivers/nvdimm/nd.h | 3
drivers/nvdimm/pfn_devs.c | 13
drivers/nvdimm/pmem.c | 27 -
drivers/nvdimm/region.c | 21 -
drivers/pci/p2pdma.c | 12
include/acpi/acpi_numa.h | 14
include/linux/dax.h | 8
include/linux/memory_hotplug.h | 5
include/linux/memremap.h | 11
include/linux/range.h | 6
kernel/resource.c | 11
lib/test_hmm.c | 15
mm/memory_hotplug.c | 10
mm/memremap.c | 299 +++++----
tools/testing/nvdimm/dax-dev.c | 22 -
tools/testing/nvdimm/test/iomap.c | 2
42 files changed, 1810 insertions(+), 588 deletions(-)
delete mode 100644 drivers/dax/hmem.c
create mode 100644 drivers/dax/hmem/Makefile
create mode 100644 drivers/dax/hmem/device.c
create mode 100644 drivers/dax/hmem/hmem.c
base-commit: 01830e6c042e8eb6eb202e05d7df8057135b4c26
5 months, 2 weeks
[PATCH 1/2] libnvdimm/security: 'security' attr never show 'overwrite' state
by Jane Chu
Since
commit d78c620a2e82 ("libnvdimm/security: Introduce a 'frozen' attribute"),
when issue
# ndctl sanitize-dimm nmem0 --overwrite
then immediately check the 'security' attribute,
# cat /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/nmem0/security
unlocked
Actually the attribute stays 'unlocked' through out the entire overwrite
operation, never changed. That's because 'nvdimm->sec.flags' is a bitmap
that has both bits set indicating 'overwrite' and 'unlocked'.
But security_show() checks the mutually exclusive bits before it checks
the 'overwrite' bit at last. The order should be reversed.
The commit also has a typo: in one occasion, 'nvdimm->sec.ext_state'
assignment is replaced with 'nvdimm->sec.flags' assignment for
the NVDIMM_MASTER type.
Cc: Dan Williams <dan.j.williams(a)intel.com>
Fixes: d78c620a2e82 ("libnvdimm/security: Introduce a 'frozen' attribute")
Signed-off-by: Jane Chu <jane.chu(a)oracle.com>
---
drivers/nvdimm/dimm_devs.c | 4 ++--
drivers/nvdimm/security.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index b7b77e8..5d72026 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -363,14 +363,14 @@ __weak ssize_t security_show(struct device *dev,
{
struct nvdimm *nvdimm = to_nvdimm(dev);
+ if (test_bit(NVDIMM_SECURITY_OVERWRITE, &nvdimm->sec.flags))
+ return sprintf(buf, "overwrite\n");
if (test_bit(NVDIMM_SECURITY_DISABLED, &nvdimm->sec.flags))
return sprintf(buf, "disabled\n");
if (test_bit(NVDIMM_SECURITY_UNLOCKED, &nvdimm->sec.flags))
return sprintf(buf, "unlocked\n");
if (test_bit(NVDIMM_SECURITY_LOCKED, &nvdimm->sec.flags))
return sprintf(buf, "locked\n");
- if (test_bit(NVDIMM_SECURITY_OVERWRITE, &nvdimm->sec.flags))
- return sprintf(buf, "overwrite\n");
return -ENOTTY;
}
diff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c
index 4cef69b..8f3971c 100644
--- a/drivers/nvdimm/security.c
+++ b/drivers/nvdimm/security.c
@@ -457,7 +457,7 @@ void __nvdimm_security_overwrite_query(struct nvdimm *nvdimm)
clear_bit(NDD_WORK_PENDING, &nvdimm->flags);
put_device(&nvdimm->dev);
nvdimm->sec.flags = nvdimm_security_flags(nvdimm, NVDIMM_USER);
- nvdimm->sec.flags = nvdimm_security_flags(nvdimm, NVDIMM_MASTER);
+ nvdimm->sec.ext_flags = nvdimm_security_flags(nvdimm, NVDIMM_MASTER);
}
void nvdimm_security_overwrite_query(struct work_struct *work)
--
1.8.3.1
5 months, 2 weeks
[PATCH v4 0/2] powerpc/papr_scm: add support for reporting NVDIMM 'life_used_percentage' metric
by Vaibhav Jain
Changes since v3[1]:
* Fixed a rebase issue pointed out by Aneesh in first patch in the series.
[1] https://lore.kernel.org/linux-nvdimm/20200730121303.134230-1-vaibhav@linu...
---
This small patchset implements kernel side support for reporting
'life_used_percentage' metric in NDCTL with dimm health output for
papr-scm NVDIMMs. With corresponding NDCTL side changes output for
should be like:
$ sudo ndctl list -DH
[
{
"dev":"nmem0",
"health":{
"health_state":"ok",
"life_used_percentage":0,
"shutdown_state":"clean"
}
}
]
PHYP supports H_SCM_PERFORMANCE_STATS hcall through which an LPAR can
fetch various performance stats including 'fuel_gauge' percentage for
an NVDIMM. 'fuel_gauge' metric indicates the usable life remaining of
an NVDIMM expressed as percentage and 'life_used_percentage' can be
calculated as 'life_used_percentage = 100 - fuel_gauge'.
Structure of the patchset
=========================
First patch implements necessary scaffolding needed to issue the
H_SCM_PERFORMANCE_STATS hcall and fetch performance stats
catalogue. The patch also implements support for 'perf_stats' sysfs
attribute to report the full catalogue of supported performance stats
by PHYP.
Second and final patch implements support for sending this value to
libndctl by extending the PAPR_PDSM_HEALTH pdsm payload to add a new
field named 'dimm_fuel_gauge' to it.
Vaibhav Jain (2):
powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric
Documentation/ABI/testing/sysfs-bus-papr-pmem | 27 +++
arch/powerpc/include/uapi/asm/papr_pdsm.h | 9 +
arch/powerpc/platforms/pseries/papr_scm.c | 199 ++++++++++++++++++
3 files changed, 235 insertions(+)
--
2.26.2
5 months, 3 weeks
[PATCH] ACPI: NFIT: Fix ARS zero-sized allocation
by Dan Williams
Pending commit in -next "devres: handle zero size in devm_kmalloc()"
triggers a boot regression due to the ARS implementation expecting NULL
from a zero-sized allocation. Avoid the zero-sized allocation by
skipping ARS, otherwise crashes with the following signature when
de-referencing ZERO_SIZE_PTR.
BUG: kernel NULL pointer dereference, address: 0000000000000018
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
RIP: 0010:__acpi_nfit_scrub+0x28a/0x350 [nfit]
[..]
Call Trace:
? acpi_nfit_query_poison+0x6a/0x180 [nfit]
acpi_nfit_scrub+0x36/0xb0 [nfit]
process_one_work+0x23c/0x580
worker_thread+0x50/0x3b0
Otherwise the implementation correctly aborts when NULL is returned from
devm_kzalloc() in ars_status_alloc().
Cc: Vishal Verma <vishal.l.verma(a)intel.com>
Cc: Dave Jiang <dave.jiang(a)intel.com>
Cc: Ira Weiny <ira.weiny(a)intel.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
drivers/acpi/nfit/core.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index fb775b967c52..26dd208a0d63 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -3334,7 +3334,7 @@ static void acpi_nfit_init_ars(struct acpi_nfit_desc *acpi_desc,
static int acpi_nfit_register_regions(struct acpi_nfit_desc *acpi_desc)
{
struct nfit_spa *nfit_spa;
- int rc;
+ int rc, do_sched_ars = 0;
set_bit(ARS_VALID, &acpi_desc->scrub_flags);
list_for_each_entry(nfit_spa, &acpi_desc->spas, list) {
@@ -3346,7 +3346,7 @@ static int acpi_nfit_register_regions(struct acpi_nfit_desc *acpi_desc)
}
}
- list_for_each_entry(nfit_spa, &acpi_desc->spas, list)
+ list_for_each_entry(nfit_spa, &acpi_desc->spas, list) {
switch (nfit_spa_type(nfit_spa->spa)) {
case NFIT_SPA_VOLATILE:
case NFIT_SPA_PM:
@@ -3354,6 +3354,13 @@ static int acpi_nfit_register_regions(struct acpi_nfit_desc *acpi_desc)
rc = ars_register(acpi_desc, nfit_spa);
if (rc)
return rc;
+
+ /*
+ * Kick off background ARS if at least one
+ * region successfully registered ARS
+ */
+ if (!test_bit(ARS_FAILED, &nfit_spa->ars_state))
+ do_sched_ars++;
break;
case NFIT_SPA_BDW:
/* nothing to register */
@@ -3372,8 +3379,10 @@ static int acpi_nfit_register_regions(struct acpi_nfit_desc *acpi_desc)
/* don't register unknown regions */
break;
}
+ }
- sched_ars(acpi_desc);
+ if (do_sched_ars)
+ sched_ars(acpi_desc);
return 0;
}
5 months, 3 weeks