[PATCH 1/2] libnvdimm/security: 'security' attr never show 'overwrite' state
by Jane Chu
Since
commit d78c620a2e82 ("libnvdimm/security: Introduce a 'frozen' attribute"),
when issue
# ndctl sanitize-dimm nmem0 --overwrite
then immediately check the 'security' attribute,
# cat /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/nmem0/security
unlocked
Actually the attribute stays 'unlocked' through out the entire overwrite
operation, never changed. That's because 'nvdimm->sec.flags' is a bitmap
that has both bits set indicating 'overwrite' and 'unlocked'.
But security_show() checks the mutually exclusive bits before it checks
the 'overwrite' bit at last. The order should be reversed.
The commit also has a typo: in one occasion, 'nvdimm->sec.ext_state'
assignment is replaced with 'nvdimm->sec.flags' assignment for
the NVDIMM_MASTER type.
Cc: Dan Williams <dan.j.williams(a)intel.com>
Fixes: d78c620a2e82 ("libnvdimm/security: Introduce a 'frozen' attribute")
Signed-off-by: Jane Chu <jane.chu(a)oracle.com>
---
drivers/nvdimm/dimm_devs.c | 4 ++--
drivers/nvdimm/security.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index b7b77e8..5d72026 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -363,14 +363,14 @@ __weak ssize_t security_show(struct device *dev,
{
struct nvdimm *nvdimm = to_nvdimm(dev);
+ if (test_bit(NVDIMM_SECURITY_OVERWRITE, &nvdimm->sec.flags))
+ return sprintf(buf, "overwrite\n");
if (test_bit(NVDIMM_SECURITY_DISABLED, &nvdimm->sec.flags))
return sprintf(buf, "disabled\n");
if (test_bit(NVDIMM_SECURITY_UNLOCKED, &nvdimm->sec.flags))
return sprintf(buf, "unlocked\n");
if (test_bit(NVDIMM_SECURITY_LOCKED, &nvdimm->sec.flags))
return sprintf(buf, "locked\n");
- if (test_bit(NVDIMM_SECURITY_OVERWRITE, &nvdimm->sec.flags))
- return sprintf(buf, "overwrite\n");
return -ENOTTY;
}
diff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c
index 4cef69b..8f3971c 100644
--- a/drivers/nvdimm/security.c
+++ b/drivers/nvdimm/security.c
@@ -457,7 +457,7 @@ void __nvdimm_security_overwrite_query(struct nvdimm *nvdimm)
clear_bit(NDD_WORK_PENDING, &nvdimm->flags);
put_device(&nvdimm->dev);
nvdimm->sec.flags = nvdimm_security_flags(nvdimm, NVDIMM_USER);
- nvdimm->sec.flags = nvdimm_security_flags(nvdimm, NVDIMM_MASTER);
+ nvdimm->sec.ext_flags = nvdimm_security_flags(nvdimm, NVDIMM_MASTER);
}
void nvdimm_security_overwrite_query(struct work_struct *work)
--
1.8.3.1
2 years
[PATCH v4 0/2] powerpc/papr_scm: add support for reporting NVDIMM 'life_used_percentage' metric
by Vaibhav Jain
Changes since v3[1]:
* Fixed a rebase issue pointed out by Aneesh in first patch in the series.
[1] https://lore.kernel.org/linux-nvdimm/20200730121303.134230-1-vaibhav@linu...
---
This small patchset implements kernel side support for reporting
'life_used_percentage' metric in NDCTL with dimm health output for
papr-scm NVDIMMs. With corresponding NDCTL side changes output for
should be like:
$ sudo ndctl list -DH
[
{
"dev":"nmem0",
"health":{
"health_state":"ok",
"life_used_percentage":0,
"shutdown_state":"clean"
}
}
]
PHYP supports H_SCM_PERFORMANCE_STATS hcall through which an LPAR can
fetch various performance stats including 'fuel_gauge' percentage for
an NVDIMM. 'fuel_gauge' metric indicates the usable life remaining of
an NVDIMM expressed as percentage and 'life_used_percentage' can be
calculated as 'life_used_percentage = 100 - fuel_gauge'.
Structure of the patchset
=========================
First patch implements necessary scaffolding needed to issue the
H_SCM_PERFORMANCE_STATS hcall and fetch performance stats
catalogue. The patch also implements support for 'perf_stats' sysfs
attribute to report the full catalogue of supported performance stats
by PHYP.
Second and final patch implements support for sending this value to
libndctl by extending the PAPR_PDSM_HEALTH pdsm payload to add a new
field named 'dimm_fuel_gauge' to it.
Vaibhav Jain (2):
powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric
Documentation/ABI/testing/sysfs-bus-papr-pmem | 27 +++
arch/powerpc/include/uapi/asm/papr_pdsm.h | 9 +
arch/powerpc/platforms/pseries/papr_scm.c | 199 ++++++++++++++++++
3 files changed, 235 insertions(+)
--
2.26.2
2 years
[PATCH v2 0/7] mm: introduce memfd_secret system call to create "secret" memory areas
by Mike Rapoport
From: Mike Rapoport <rppt(a)linux.ibm.com>
Hi,
This is an implementation of "secret" mappings backed by a file descriptor.
v2 changes:
* Follow Michael's suggestion and name the new system call 'memfd_secret'
* Add kernel-parameters documentation about the boot option
* Fix i386-tinyconfig regression reported by the kbuild bot.
CONFIG_SECRETMEM now depends on !EMBEDDED to disable it on small systems
from one side and still make it available unconditionally on
architectures that support SET_DIRECT_MAP.
The file descriptor backing secret memory mappings is created using a
dedicated memfd_secret system call The desired protection mode for the
memory is configured using flags parameter of the system call. The mmap()
of the file descriptor created with memfd_secret() will create a "secret"
memory mapping. The pages in that mapping will be marked as not present in
the direct map and will have desired protection bits set in the user page
table. For instance, current implementation allows uncached mappings.
Although normally Linux userspace mappings are protected from other users,
such secret mappings are useful for environments where a hostile tenant is
trying to trick the kernel into giving them access to other tenants
mappings.
Additionally, the secret mappings may be used as a mean to protect guest
memory in a virtual machine host.
For demonstration of secret memory usage we've created a userspace library
[1] that does two things: the first is act as a preloader for openssl to
redirect all the OPENSSL_malloc calls to secret memory meaning any secret
keys get automatically protected this way and the other thing it does is
expose the API to the user who needs it. We anticipate that a lot of the
use cases would be like the openssl one: many toolkits that deal with
secret keys already have special handling for the memory to try to give
them greater protection, so this would simply be pluggable into the
toolkits without any need for user application modification.
I've hesitated whether to continue to use new flags to memfd_create() or to
add a new system call and I've decided to use a new system call after I've
started to look into man pages update. There would have been two completely
independent descriptions and I think it would have been very confusing.
Hiding secret memory mappings behind an anonymous file allows (ab)use of
the page cache for tracking pages allocated for the "secret" mappings as
well as using address_space_operations for e.g. page migration callbacks.
The anonymous file may be also used implicitly, like hugetlb files, to
implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm
ABIs in the future.
As the fragmentation of the direct map was one of the major concerns raised
during the previous postings, I've added an amortizing cache of PMD-size
pages to each file descriptor and an ability to reserve large chunks of the
physical memory at boot time and then use this memory as an allocation pool
for the secret memory areas.
v1: https://lore.kernel.org/lkml/20200720092435.17469-1-rppt@kernel.org/
rfc-v2: https://lore.kernel.org/lkml/20200706172051.19465-1-rppt@kernel.org/
rfc-v1: https://lore.kernel.org/lkml/20200130162340.GA14232@rapoport-lnx/
Mike Rapoport (7):
mm: add definition of PMD_PAGE_ORDER
mmap: make mlock_future_check() global
mm: introduce memfd_secret system call to create "secret" memory areas
arch, mm: wire up memfd_secret system call were relevant
mm: secretmem: use PMD-size pages to amortize direct map fragmentation
mm: secretmem: add ability to reserve memory at boot
mm: secretmem: add ability to reserve memory at boot
.../admin-guide/kernel-parameters.txt | 4 +
arch/arm64/include/asm/unistd32.h | 2 +
arch/arm64/include/uapi/asm/unistd.h | 1 +
arch/riscv/include/asm/unistd.h | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
fs/dax.c | 10 +-
include/linux/pgtable.h | 3 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 7 +-
include/uapi/linux/magic.h | 1 +
include/uapi/linux/secretmem.h | 9 +
kernel/sys_ni.c | 2 +
mm/Kconfig | 4 +
mm/Makefile | 1 +
mm/internal.h | 3 +
mm/mmap.c | 5 +-
mm/secretmem.c | 453 ++++++++++++++++++
18 files changed, 500 insertions(+), 9 deletions(-)
create mode 100644 include/uapi/linux/secretmem.h
create mode 100644 mm/secretmem.c
--
2.26.2
2 years
[PATCH v3 0/6] Fix and enable pmem as RAM device on arm64
by Jia He
This fixies a few issues when I tried to enable pmem as RAM device on arm64.
To use memory_add_physaddr_to_nid as a fallback nid, it would be better
implement a general version (__weak) in mm/memory_hotplug. After that, arm64/
sh/s390 can simply use the general version, and PowerPC/ia64/x86 will use
arch specific version.
Tested on ThunderX2 host/qemu "-M virt" guest with a nvdimm device. The
memblocks from the dax pmem device can be either hot-added or hot-removed
on arm64 guest. Also passed the compilation test on x86.
Changes:
v3: - introduce general version memory_add_physaddr_to_nid, refine the arch
specific one
- fix an uninitialization bug in v2 device-dax patch
v2: https://lkml.org/lkml/2020/7/7/71
- Drop unnecessary patch to harden try_offline_node
- Use new solution(by David) to fix dev->target_node=-1 during probing
- Refine the mem_hotplug_begin/done patch
v1: https://lkml.org/lkml/2020/7/5/381
Jia He (6):
mm/memory_hotplug: introduce default dummy
memory_add_physaddr_to_nid()
arm64/mm: use default dummy memory_add_physaddr_to_nid()
sh/mm: use default dummy memory_add_physaddr_to_nid()
mm: don't export memory_add_physaddr_to_nid in arch specific directory
device-dax: use fallback nid when numa_node is invalid
mm/memory_hotplug: fix unpaired mem_hotplug_begin/done
arch/arm64/mm/numa.c | 10 ----------
arch/ia64/mm/numa.c | 2 --
arch/sh/mm/init.c | 9 ---------
arch/x86/mm/numa.c | 1 -
drivers/dax/kmem.c | 21 +++++++++++++--------
mm/memory_hotplug.c | 15 ++++++++++++---
6 files changed, 25 insertions(+), 33 deletions(-)
--
2.17.1
2 years
[PATCH v2 00/22] device-dax: Support sub-dividing soft-reserved
ranges
by Dan Williams
Changes since v1 [1]:
- Combine this series with "Manual definition of Soft Reserved memory
devices" [2] as the pre-requisites are required to test the device-dax
facility, the device-dax changes are part of the justification for the
numa-info reworks.
- Provide a generic version of numa data retrieval based on memblock for
arm64 rather than adding a new / empty phys_to_target_node() stub
alongside memory_add_physaddr_to_nid(). (Will)
- Fix several corner case allocation bugs and pass the unit test written
by Joao.
- Lift the restriction that a 'seed' device must be activated before a
new seed can be created. This minor sanity check was to prevent userspace
spamming devices, but it gets in the way of some of allocation
scenarios like allocating a memory-range that is guaranteed to never be
evicted due to memory-side-cache conflicts. (Iqbal)
- Add debug prints for space allocation decisions (Joao)
- Rebased on v5.8-rc2 which included resolving conflicts in the kmem
driver and memremap_pages().
[1]: http://lore.kernel.org/r/158500767138.2088294.17131646259803932461.stgit@...
[2]: http://lore.kernel.org/r/158489354353.1457606.8327903161927980740.stgit@d...
---
The device-dax facility allows an address range to be directly mapped
through a chardev, or optionally hotplugged to the core kernel page
allocator as System-RAM. It is the mechanism for converting persistent
memory (pmem) to be used as another volatile memory pool i.e. the
current Memory Tiering hot topic on linux-mm.
In the case of pmem the nvdimm-namespace-label mechanism can sub-divide
it, but that labeling mechanism is not available / applicable to
soft-reserved ("EFI specific purpose") memory [3]. This series provides
a sysfs-mechanism for the daxctl utility to enable provisioning of
volatile-soft-reserved memory ranges.
The motivations for this facility are:
1/ Allow performance differentiated memory ranges to be split between
kernel-managed and directly-accessed use cases.
2/ Allow physical memory to be provisioned along performance relevant
address boundaries. For example, divide a memory-side cache [4] along
cache-color boundaries.
3/ Parcel out soft-reserved memory to VMs using device-dax as a security
/ permissions boundary [5]. Specifically I have seen people (ab)using
memmap=nn!ss (mark System-RAM as Persistent Memory) just to get the
device-dax interface on custom address ranges. A follow-on for the VM
use case is to teach device-dax to dynamically allocate 'struct page' at
runtime to reduce the duplication of 'struct page' space in both the
guest and the host kernel for the same physical pages.
Given the intersections of arm64, x86, and core memremap_pages() changes
I'd like to explore taking this through the libnvdimm tree, but that is
step 2. Any concerns with the proposed infrastructure changes
(memblock-numainfo and multi-range-memremap-pages)?
[3]: http://lore.kernel.org/r/157309097008.1579826.12818463304589384434.stgit@...
[4]: http://lore.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@...
[5]: http://lore.kernel.org/r/20200110190313.17144-1-joao.m.martins@oracle.com
---
Dan Williams (22):
x86/numa: Cleanup configuration dependent command-line options
x86/numa: Add 'nohmat' option
efi/fake_mem: Arrange for a resource entry per efi_fake_mem instance
ACPI: HMAT: Refactor hmat_register_target_device to hmem_register_device
resource: Report parent to walk_iomem_res_desc() callback
x86: Move NUMA_KEEP_MEMINFO and related definition to x86-internals
numa: Introduce a generic memory_add_physaddr_to_nid()
memblock: Introduce a generic phys_addr_to_target_node()
arm64: Convert to generic memblock for numa-info
ACPI: HMAT: Attach a device for each soft-reserved range
device-dax: Drop the dax_region.pfn_flags attribute
device-dax: Move instance creation parameters to 'struct dev_dax_data'
device-dax: Make pgmap optional for instance creation
device-dax: Kill dax_kmem_res
device-dax: Add an allocation interface for device-dax instances
device-dax: Introduce 'seed' devices
drivers/base: Make device_find_child_by_name() compatible with sysfs inputs
device-dax: Add resize support
mm/memremap_pages: Convert to 'struct range'
mm/memremap_pages: Support multiple ranges per invocation
device-dax: Add dis-contiguous resource support
device-dax: Introduce 'mapping' devices
arch/arm64/Kconfig | 1
arch/arm64/mm/numa.c | 10
arch/powerpc/kvm/book3s_hv_uvmem.c | 14
arch/x86/Kconfig | 7
arch/x86/include/asm/numa.h | 8
arch/x86/kernel/e820.c | 16 +
arch/x86/mm/numa.c | 12
arch/x86/mm/numa_emulation.c | 3
arch/x86/mm/numa_internal.h | 7
arch/x86/xen/enlighten_pv.c | 2
drivers/acpi/numa/hmat.c | 76 ---
drivers/acpi/numa/srat.c | 9
drivers/base/core.c | 2
drivers/dax/Kconfig | 6
drivers/dax/Makefile | 3
drivers/dax/bus.c | 902 ++++++++++++++++++++++++++++++--
drivers/dax/bus.h | 28 +
drivers/dax/dax-private.h | 39 +
drivers/dax/device.c | 97 ++-
drivers/dax/hmem/Makefile | 6
drivers/dax/hmem/device.c | 100 ++++
drivers/dax/hmem/hmem.c | 20 -
drivers/dax/kmem.c | 199 ++++---
drivers/dax/pmem/compat.c | 2
drivers/dax/pmem/core.c | 22 +
drivers/firmware/efi/x86_fake_mem.c | 12
drivers/gpu/drm/nouveau/nouveau_dmem.c | 4
drivers/nvdimm/badrange.c | 26 -
drivers/nvdimm/claim.c | 13
drivers/nvdimm/nd.h | 3
drivers/nvdimm/pfn_devs.c | 13
drivers/nvdimm/pmem.c | 27 +
drivers/nvdimm/region.c | 21 -
drivers/pci/p2pdma.c | 12
include/acpi/acpi_numa.h | 14
include/linux/dax.h | 8
include/linux/memblock.h | 4
include/linux/memremap.h | 11
include/linux/mm.h | 13
include/linux/numa.h | 9
include/linux/range.h | 6
kernel/resource.c | 11
mm/Kconfig | 7
mm/memblock.c | 22 +
mm/memremap.c | 300 ++++++-----
mm/page_alloc.c | 82 +++
tools/testing/nvdimm/dax-dev.c | 22 +
tools/testing/nvdimm/test/iomap.c | 2
48 files changed, 1705 insertions(+), 528 deletions(-)
create mode 100644 drivers/dax/hmem/Makefile
create mode 100644 drivers/dax/hmem/device.c
rename drivers/dax/{hmem.c => hmem/hmem.c} (74%)
base-commit: 48778464bb7d346b47157d21ffde2af6b2d39110
2 years
[PATCH v3 0/2] powerpc/papr_scm: add support for reporting NVDIMM 'life_used_percentage' metric
by Vaibhav Jain
Changes since v2[1]:
* Updated drc_pmem_query_stats() to reduce the number of input args
to the function based suggestions from Aneesh.
[1] https://lore.kernel.org/linux-nvdimm/20200726122030.31529-1-vaibhav@linux...
---
This small patchset implements kernel side support for reporting
'life_used_percentage' metric in NDCTL with dimm health output for
papr-scm NVDIMMs. With corresponding NDCTL side changes output for
should be like:
$ sudo ndctl list -DH
[
{
"dev":"nmem0",
"health":{
"health_state":"ok",
"life_used_percentage":0,
"shutdown_state":"clean"
}
}
]
PHYP supports H_SCM_PERFORMANCE_STATS hcall through which an LPAR can
fetch various performance stats including 'fuel_gauge' percentage for
an NVDIMM. 'fuel_gauge' metric indicates the usable life remaining of
an NVDIMM expressed as percentage and 'life_used_percentage' can be
calculated as 'life_used_percentage = 100 - fuel_gauge'.
Structure of the patchset
=========================
First patch implements necessary scaffolding needed to issue the
H_SCM_PERFORMANCE_STATS hcall and fetch performance stats
catalogue. The patch also implements support for 'perf_stats' sysfs
attribute to report the full catalogue of supported performance stats
by PHYP.
Second and final patch implements support for sending this value to
libndctl by extending the PAPR_PDSM_HEALTH pdsm payload to add a new
field named 'dimm_fuel_gauge' to it.
Vaibhav Jain (2):
powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric
Documentation/ABI/testing/sysfs-bus-papr-pmem | 27 +++
arch/powerpc/include/uapi/asm/papr_pdsm.h | 9 +
arch/powerpc/platforms/pseries/papr_scm.c | 199 ++++++++++++++++++
3 files changed, 235 insertions(+)
--
2.26.2
2 years
[PATCH] dax: Fix wrong error-number passed into xas_set_err()
by Hao Li
The error-number passed into xas_set_err() should be negative. Otherwise,
the xas_error() will return 0, and grab_mapping_entry() will return the
found entry instead of a SIGBUS error when the entry is not a value.
And then, the subsequent code path would be wrong.
Signed-off-by: Hao Li <lihao2018.fnst(a)cn.fujitsu.com>
---
fs/dax.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/dax.c b/fs/dax.c
index 11b16729b86f..acac675fe7a6 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -488,7 +488,7 @@ static void *grab_mapping_entry(struct xa_state *xas,
if (dax_is_conflict(entry))
goto fallback;
if (!xa_is_value(entry)) {
- xas_set_err(xas, EIO);
+ xas_set_err(xas, -EIO);
goto out_unlock;
}
--
2.28.0
2 years
Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem
alignment
by Mike Rapoport
Hi Justin,
On Wed, Jul 29, 2020 at 08:27:58AM +0000, Justin He wrote:
> Hi David
> > >
> > > Without this series, if qemu creates a 4G bytes nvdimm device, we can
> > only
> > > use 2G bytes for dax pmem(kmem) in the worst case.
> > > e.g.
> > > 240000000-33fdfffff : Persistent Memory
> > > We can only use the memblock between [240000000, 2ffffffff] due to the
> > hard
> > > limitation. It wastes too much memory space.
> > >
> > > Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but
> > there
> > > are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
> > > SPARSEMEM_VMEMMAP, page bits in struct page ...
> > >
> > > Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem
> > alignment
> > > with memory_block_size_bytes().
> > >
> > > Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax
> > pmem
> > > can be used as ram with smaller gap. Also the kmem hotplug add/remove
> > are both
> > > tested on arm64/x86 guest.
> > >
> >
> > Hi,
> >
> > I am not convinced this use case is worth such hacks (that’s what it is)
> > for now. On real machines pmem is big - your example (losing 50% is
> > extreme).
> >
> > I would much rather want to see the section size on arm64 reduced. I
> > remember there were patches and that at least with a base page size of 4k
> > it can be reduced drastically (64k base pages are more problematic due to
> > the ridiculous THP size of 512M). But could be a section size of 512 is
> > possible on all configs right now.
>
> Yes, I once investigated how to reduce section size on arm64 thoughtfully:
> There are many constraints for reducing SECTION_SIZE_BITS
> 1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be reduced too
> much.
> 2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be counted
> into page->flags.
> 3. MAX_ORDER depends on SECTION_SIZE_BITS
> - 3.1 mmzone.h
> #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
> #error Allocator MAX_ORDER exceeds SECTION_SIZE
> #endif
> - 3.2 hugepage_init()
> MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER);
>
> Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled,
> SECTION_SIZE_BITS can be reduced to 27.
> But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13.
> Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS can not
> be reduced to 27.
>
> In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the Kconfig
> might be very complicated,e.g. we still need to consider the case for
> ARM64_16K_PAGES.
It is not necessary to pollute Kconfig with that.
arch/arm64/include/asm/sparesemem.h can have something like
#ifdef CONFIG_ARM64_64K_PAGES
#define SPARSE_SECTION_SIZE 29
#elif defined(CONFIG_ARM16K_PAGES)
#define SPARSE_SECTION_SIZE 28
#elif defined(CONFIG_ARM4K_PAGES)
#define SPARSE_SECTION_SIZE 27
#else
#error
#endif
There is still large gap with ARM64_64K_PAGES, though.
As for SPARSEMEM without VMEMMAP, are there actual benefits to use it?
> >
> > In the long term we might want to rework the memory block device model
> > (eventually supporting old/new as discussed with Michal some time ago
> > using a kernel parameter), dropping the fixed sizes
>
> Has this been posted to Linux mm maillist? Sorry, searched and didn't find it.
>
>
> --
> Cheers,
> Justin (Jia He)
>
>
>
> > - allowing sizes / addresses aligned with subsection size
> > - drastically reducing the number of devices for boot memory to only a
> > hand full (e.g., one per resource / DIMM we can actually unplug again.
> >
> > Long story short, I don’t like this hack.
> >
> >
> > > This patch series (mainly patch6/6) is based on the fixing patch, ~v5.8-
> > rc5 [2].
> > >
> > > [1] https://lkml.org/lkml/2019/6/19/67
> > > [2] https://lkml.org/lkml/2020/7/8/1546
> > > Jia He (6):
> > > mm/memory_hotplug: remove redundant memory block size alignment check
> > > resource: export find_next_iomem_res() helper
> > > mm/memory_hotplug: allow pmem kmem not to align with memory_block_size
> > > mm/page_alloc: adjust the start,end in dax pmem kmem case
> > > device-dax: relax the memblock size alignment for kmem_start
> > > arm64: fall back to vmemmap_populate_basepages if not aligned with
> > > PMD_SIZE
> > >
> > > arch/arm64/mm/mmu.c | 4 ++++
> > > drivers/base/memory.c | 24 ++++++++++++++++--------
> > > drivers/dax/kmem.c | 22 +++++++++++++---------
> > > include/linux/ioport.h | 3 +++
> > > kernel/resource.c | 3 ++-
> > > mm/memory_hotplug.c | 39 ++++++++++++++++++++++++++++++++++++++-
> > > mm/page_alloc.c | 14 ++++++++++++++
> > > 7 files changed, 90 insertions(+), 19 deletions(-)
> > >
> > > --
> > > 2.17.1
> > >
>
--
Sincerely yours,
Mike.
2 years
[PATCH v3 00/11] ACPI/NVDIMM: Runtime Firmware Activation
by Dan Williams
Changes since v2 [1]:
- Drop the "mem-quiet" pm-debug interface in favor of an explicit
hibernate_quiet_exec() helper that executes firmware activation, or
any other subsystem provided routine, in a system-quiet context.
(Rafael)
- Rework the sysfs interface to add an explicit trigger to run
activation under hibernate_quiet_exec(). Rename
ndbusX/firmware_activate to ndbusX/firmware/activate, and add a
ndbusX/firmware/capability. Some ndctl reworks are needed to catch up
with this change.
- The new ndbusX/firmware/capability attribute indicates the default
activation method / execution context between "live" and "suspend".
[1]: http://lore.kernel.org/r/159408711335.2385045.2567600405906448375.stgit@d...
---
Quoting the documentation:
Some persistent memory devices run a firmware locally on the device /
"DIMM" to perform tasks like media management, capacity provisioning,
and health monitoring. The process of updating that firmware typically
involves a reboot because it has implications for in-flight memory
transactions. However, reboots are disruptive and at least the Intel
persistent memory platform implementation, described by the Intel ACPI
DSM specification [1], has added support for activating firmware at
runtime.
[1]: https://docs.pmem.io/persistent-memory/
The approach taken is to abstract the Intel platform specific mechanism
behind a libnvdimm-generic sysfs interface. The interface could support
runtime-firmware-activation on another architecture without need to
change userspace tooling.
The ACPI NFIT implementation involves a set of device-specific-methods
(DSMs) to 'arm' individual devices for activation and bus-level
'trigger' method to execute the activation. Informational / enumeration
methods are also provided at the bus and device level.
One complicating aspect of the memory device firmware activation is that
the memory controller may need to be quiesced, no memory cycles, during
the activation. While the platform has mechanisms to support holding off
in-flight DMA during the activation, the device response to that delay
is potentially undefined. The platform may reject a runtime firmware
update if, for example a PCI-E device does not support its completion
timeout value being increased to meet the activation time. Outside of
device timeouts the quiesce period may also violate application
timeouts.
Given the above device and application timeout considerations the
implementation uses a new hibernate_quiet_exec() facility to carry-out
firmware activation. This imposes the same conditions that allow for a
stable memory image snapshot to be taken for a hibernate-to-disk
sequence. However, if desired, runtime activation without the hibernate
freeze can be forced as an override.
The ndctl utility grows the following extensions / commands to drive
this mechanism:
1/ The existing update-firmware command will 'arm' devices where the
firmware image is staged by default.
ndctl update-firmware all -f firmware_image.bin
2/ The existing ability to enumerate firmware-update capabilities now
includes firmware activate capabilities at the 'bus' and 'dimm/device'
level:
ndctl list -BDF -b nfit_test.0
[
{
"provider":"nfit_test.0",
"dev":"ndbus2",
"scrub_state":"idle",
"firmware":{
"activate_method":"suspend",
"activate_state":"idle"
},
"dimms":[
{
"dev":"nmem1",
"id":"cdab-0a-07e0-ffffffff",
"handle":0,
"phys_id":0,
"security":"disabled",
"firmware":{
"current_version":0,
"can_update":true
}
},
...
3/ The new activate-firmware command triggers firmware activation per
the platform enumerated context, "suspend" vs "live", or can be forced
to "live" if there is a explicit knowledge that allowing applications
and devices to race the quiesce timeout will have no adverse effects.
ndctl activate-firmware nfit_test.0 [--force]
These patches are passing an updated version of the ndctl
"firmware-update.sh" unit test (to be posted).
---
Dan Williams (11):
libnvdimm: Validate command family indices
ACPI: NFIT: Move bus_dsm_mask out of generic nvdimm_bus_descriptor
ACPI: NFIT: Define runtime firmware activation commands
tools/testing/nvdimm: Cleanup dimm index passing
tools/testing/nvdimm: Add command debug messages
tools/testing/nvdimm: Prepare nfit_ctl_test() for ND_CMD_CALL emulation
tools/testing/nvdimm: Emulate firmware activation commands
driver-core: Introduce DEVICE_ATTR_ADMIN_{RO,RW}
libnvdimm: Convert to DEVICE_ATTR_ADMIN_RO()
PM, libnvdimm: Add runtime firmware activation support
ACPI: NFIT: Add runtime firmware activate support
Documentation/ABI/testing/sysfs-bus-nfit | 19 +
Documentation/ABI/testing/sysfs-bus-nvdimm | 2
.../driver-api/nvdimm/firmware-activate.rst | 86 ++++
drivers/acpi/nfit/core.c | 142 +++++--
drivers/acpi/nfit/intel.c | 386 ++++++++++++++++++++
drivers/acpi/nfit/intel.h | 61 +++
drivers/acpi/nfit/nfit.h | 38 ++
drivers/nvdimm/bus.c | 16 +
drivers/nvdimm/core.c | 149 ++++++++
drivers/nvdimm/dimm_devs.c | 119 ++++++
drivers/nvdimm/namespace_devs.c | 2
drivers/nvdimm/nd-core.h | 1
drivers/nvdimm/pfn_devs.c | 2
drivers/nvdimm/region_devs.c | 2
include/linux/device.h | 4
include/linux/libnvdimm.h | 52 +++
include/linux/suspend.h | 6
include/linux/sysfs.h | 7
include/uapi/linux/ndctl.h | 5
kernel/power/hibernate.c | 97 +++++
tools/testing/nvdimm/test/nfit.c | 367 +++++++++++++++----
21 files changed, 1449 insertions(+), 114 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-bus-nvdimm
create mode 100644 Documentation/driver-api/nvdimm/firmware-activate.rst
base-commit: 48778464bb7d346b47157d21ffde2af6b2d39110
2 years
Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem
alignment
by David Hildenbrand
On 29.07.20 10:27, Justin He wrote:
> Hi David
>
>> -----Original Message-----
>> From: David Hildenbrand <david(a)redhat.com>
>> Sent: Wednesday, July 29, 2020 2:37 PM
>> To: Justin He <Justin.He(a)arm.com>
>> Cc: Dan Williams <dan.j.williams(a)intel.com>; Vishal Verma
>> <vishal.l.verma(a)intel.com>; Mike Rapoport <rppt(a)linux.ibm.com>; David
>> Hildenbrand <david(a)redhat.com>; Catalin Marinas <Catalin.Marinas(a)arm.com>;
>> Will Deacon <will(a)kernel.org>; Greg Kroah-Hartman
>> <gregkh(a)linuxfoundation.org>; Rafael J. Wysocki <rafael(a)kernel.org>; Dave
>> Jiang <dave.jiang(a)intel.com>; Andrew Morton <akpm(a)linux-foundation.org>;
>> Steve Capper <Steve.Capper(a)arm.com>; Mark Rutland <Mark.Rutland(a)arm.com>;
>> Logan Gunthorpe <logang(a)deltatee.com>; Anshuman Khandual
>> <Anshuman.Khandual(a)arm.com>; Hsin-Yi Wang <hsinyi(a)chromium.org>; Jason
>> Gunthorpe <jgg(a)ziepe.ca>; Dave Hansen <dave.hansen(a)linux.intel.com>; Kees
>> Cook <keescook(a)chromium.org>; linux-arm-kernel(a)lists.infradead.org; linux-
>> kernel(a)vger.kernel.org; linux-nvdimm(a)lists.01.org; linux-mm(a)kvack.org; Wei
>> Yang <richardw.yang(a)linux.intel.com>; Pankaj Gupta
>> <pankaj.gupta.linux(a)gmail.com>; Ira Weiny <ira.weiny(a)intel.com>; Kaly Xin
>> <Kaly.Xin(a)arm.com>
>> Subject: Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem
>> alignment
>>
>>
>>
>>> Am 29.07.2020 um 05:35 schrieb Jia He <justin.he(a)arm.com>:
>>>
>>> When enabling dax pmem as RAM device on arm64, I noticed that kmem_start
>>> addr in dev_dax_kmem_probe() should be aligned w/
>> SECTION_SIZE_BITS(30),i.e.
>>> 1G memblock size. Even Dan Williams' sub-section patch series [1] had
>> been
>>> upstream merged, it was not helpful due to hard limitation of kmem_start:
>>> $ndctl create-namespace -e namespace0.0 --mode=devdax --map=dev -s 2g -f
>> -a 2M
>>> $echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
>>> $echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
>>> $cat /proc/iomem
>>> ...
>>> 23c000000-23fffffff : System RAM
>>> 23dd40000-23fecffff : reserved
>>> 23fed0000-23fffffff : reserved
>>> 240000000-33fdfffff : Persistent Memory
>>> 240000000-2403fffff : namespace0.0
>>> 280000000-2bfffffff : dax0.0 <- aligned with 1G boundary
>>> 280000000-2bfffffff : System RAM
>>> Hence there is a big gap between 0x2403fffff and 0x280000000 due to the
>> 1G
>>> alignment.
>>>
>>> Without this series, if qemu creates a 4G bytes nvdimm device, we can
>> only
>>> use 2G bytes for dax pmem(kmem) in the worst case.
>>> e.g.
>>> 240000000-33fdfffff : Persistent Memory
>>> We can only use the memblock between [240000000, 2ffffffff] due to the
>> hard
>>> limitation. It wastes too much memory space.
>>>
>>> Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but
>> there
>>> are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
>>> SPARSEMEM_VMEMMAP, page bits in struct page ...
>>>
>>> Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem
>> alignment
>>> with memory_block_size_bytes().
>>>
>>> Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax
>> pmem
>>> can be used as ram with smaller gap. Also the kmem hotplug add/remove
>> are both
>>> tested on arm64/x86 guest.
>>>
>>
>> Hi,
>>
>> I am not convinced this use case is worth such hacks (that’s what it is)
>> for now. On real machines pmem is big - your example (losing 50% is
>> extreme).
>>
>> I would much rather want to see the section size on arm64 reduced. I
>> remember there were patches and that at least with a base page size of 4k
>> it can be reduced drastically (64k base pages are more problematic due to
>> the ridiculous THP size of 512M). But could be a section size of 512 is
>> possible on all configs right now.
>
> Yes, I once investigated how to reduce section size on arm64 thoughtfully:
> There are many constraints for reducing SECTION_SIZE_BITS
> 1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be reduced too
> much.
> 2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be counted
> into page->flags.
Yep.
> 3. MAX_ORDER depends on SECTION_SIZE_BITS
> - 3.1 mmzone.h
> #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
> #error Allocator MAX_ORDER exceeds SECTION_SIZE
> #endif
Yep, with 4k base pages it's 4 MB. However, with 64k base pages its
512MB ( :( ).
> - 3.2 hugepage_init()
> MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER);
>
> Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled,
> SECTION_SIZE_BITS can be reduced to 27.
> But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13.
> Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS can not
> be reduced to 27.
I think there were plans to eventually switch to 2MB THP with 64k base
pages as well (which can be emulated using some sort of consecutive PTE
entries under arm64, don't ask me how this feature is called),
theoretically also allowing smaller section sizes (when also reducing
MAX_ORDER properly) I would highly appreciate that switch. Having max
allocation/THP in the size of gigantic pages sounds very weird to me
(and creates issues e.g., to support hot(un)plug of small memory blocks
for virtio-mem). But I guess this is not under our control :)
>
> In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the Kconfig
> might be very complicated,e.g. we still need to consider the case for
> ARM64_16K_PAGES.
Haven't looked into 16k base pages yet. But I remember it's in general
more similar to 4k than to 64k (speaking about sane THP sizes and
similar ...).
>
>>
>> In the long term we might want to rework the memory block device model
>> (eventually supporting old/new as discussed with Michal some time ago
>> using a kernel parameter), dropping the fixed sizes
>
> Has this been posted to Linux mm maillist? Sorry, searched and didn't find it.
Yeah, but I might not be able to dig it out anymore ...
Anyhow, the idea would be to have some magic switch that converts
between old and new world, to not break userspace that relies on that.
With old, everything would continue to work as it is. With *new* we
would have the reduced number of memory blocks for boot memory and
decoupled it from a strict, static memory block size.
There would be another option in corner cases right now. If you would
*know* that the metadata memory has no memmap/idendity mapping and have
1G alignment for your pmem device (including the metadata part)
1. add_memory_device_managed() the whole memory, including the metadata part
2. use generic_online_pages() to not expose metadata pages to the buddy
3. Mark metdata pages in a special way, such that you can e.g., allow to
offline memory again, including the metdata pages (e.g., PG_offline +
memory notifier like virtio-mem does)
3. would only be relevant to support offlining of memory again.
If the metadata part is, however, already ZONE_DEVICE with a memmap,
then that's not an option. (I have no idea how that metadata part is
used, sorry)
--
Thanks,
David / dhildenb
2 years