[PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps
by Joao Martins
Hey,
This series, attempts at minimizing 'struct page' overhead by
pursuing a similar approach as Muchun Song series "Free some vmemmap
pages of hugetlb page"[0] but applied to devmap/ZONE_DEVICE.
[0] https://lore.kernel.org/linux-mm/20210308102807.59745-1-songmuchun@byteda...
The link above describes it quite nicely, but the idea is to reuse tail
page vmemmap areas, particular the area which only describes tail pages.
So a vmemmap page describes 64 struct pages, and the first page for a given
ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second
vmemmap page would contain only tail pages, and that's what gets reused across
the rest of the subsection/section. The bigger the page size, the bigger the
savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages).
This series also takes one step further on 1GB pages and *also* reuse PMD pages
which only contain tail pages which allows to keep parity with current hugepage
based memmap. This further let us more than halve the overhead with 1GB pages
(40M -> 16M per Tb)
In terms of savings, per 1Tb of memory, the struct page cost would go down
with compound pagemap:
* with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
* with 1G pages we lose 16MB instead of 16G (0.0014% instead of 1.5% of total memory)
Along the way I've extended it past 'struct page' overhead *trying* to address a
few performance issues we knew about for pmem, specifically on the
{pin,get}_user_pages_fast with device-dax vmas which are really
slow even of the fast variants. THP is great on -fast variants but all except
hugetlbfs perform rather poorly on non-fast gup. Although I deferred the
__get_user_pages() improvements (in a follow up series I have stashed as its
ortogonal to device-dax as THP suffers from the same syndrome).
So to summarize what the series does:
Patch 1: Prepare hwpoisoning to work with dax compound pages.
Patches 2-4: Have memmap_init_zone_device() initialize its metadata as compound
pages. We split the current utility function of prep_compound_page() into head
and tail and use those two helpers where appropriate to take advantage of caches
being warm after __init_single_page(). Since RFC this also lets us further speed
up from 190ms down to 80ms init time.
Patches 5-10: Much like Muchun series, we reuse PTE (and PMD) tail page vmemmap
areas across a given page size (namely @align was referred by remaining
memremap/dax code) and enabling of memremap to initialize the ZONE_DEVICE pages
as compound pages or a given @align order. The main difference though, is that
contrary to the hugetlbfs series, there's no vmemmap for the area, because we
are populating it as opposed to remapping it. IOW no freeing of pages of
already initialized vmemmap like the case for hugetlbfs, which simplifies the
logic (besides not being arch-specific). After these, there's quite visible
region bootstrap of pmem memmap given that we would initialize fewer struct
pages depending on the page size with DRAM backed struct pages. altmap sees no
difference in bootstrap.
NVDIMM namespace bootstrap improves from ~268-358 ms to ~78-100/<1ms on 128G NVDIMMs
with 2M and 1G respectivally.
Patch 11: Optimize grabbing page refcount changes given that we
are working with compound pages i.e. we do 1 increment to the head
page for a given set of N subpages compared as opposed to N individual writes.
{get,pin}_user_pages_fast() for zone_device with compound pagemap consequently
improves considerably with DRAM stored struct pages. It also *greatly*
improves pinning with altmap. Results with gup_test:
before after
(16G get_user_pages_fast 2M page size) ~59 ms -> ~6.1 ms
(16G pin_user_pages_fast 2M page size) ~87 ms -> ~6.2 ms
(16G get_user_pages_fast altmap 2M page size) ~494 ms -> ~9 ms
(16G pin_user_pages_fast altmap 2M page size) ~494 ms -> ~10 ms
altmap performance gets specially interesting when pinning a pmem dimm:
before after
(128G get_user_pages_fast 2M page size) ~492 ms -> ~49 ms
(128G pin_user_pages_fast 2M page size) ~493 ms -> ~50 ms
(128G get_user_pages_fast altmap 2M page size) ~3.91 ms -> ~70 ms
(128G pin_user_pages_fast altmap 2M page size) ~3.97 ms -> ~74 ms
The unpinning improvement patches are in mmotm/linux-next so removed from this
series.
I have deferred the __get_user_pages() patch to outside this series
(https://lore.kernel.org/linux-mm/20201208172901.17384-11-joao.m.martins@o...),
as I found an simpler way to address it and that is also applicable to
THP. But will submit that as a follow up of this.
Patches apply on top of linux-next tag next-20210325 (commit b4f20b70784a).
Comments and suggestions very much appreciated!
Changelog,
RFC -> v1:
(New patches 1-3, 5-8 but the diffstat is that different)
* Fix hwpoisoning of devmap pages reported by Jane (Patch 1 is new in v1)
* Fix/Massage commit messages to be more clear and remove the 'we' occurences (Dan, John, Matthew)
* Use pfn_align to be clear it's nr of pages for @align value (John, Dan)
* Add two helpers pgmap_align() and pgmap_pfn_align() as accessors of pgmap->align;
* Remove the gup_device_compound_huge special path and have the same code
work both ways while special casing when devmap page is compound (Jason, John)
* Avoid usage of vmemmap_populate_basepages() and introduce a first class
loop that doesn't care about passing an altmap for memmap reuse. (Dan)
* Completely rework the vmemmap_populate_compound() to avoid the sparse_add_section
hack into passing block across sparse_add_section calls. It's a lot easier to
follow and more explicit in what it does.
* Replace the vmemmap refactoring with adding a @pgmap argument and moving
parts of the vmemmap_populate_base_pages(). (Patch 5 and 6 are new as a result)
* Add PMD tail page vmemmap area reuse for 1GB pages. (Patch 8 is new)
* Improve memmap_init_zone_device() to initialize compound pages when
struct pages are cache warm. That lead to a even further speed up further
from RFC series from 190ms -> 80-120ms. Patches 2 and 3 are the new ones
as a result (Dan)
* Remove PGMAP_COMPOUND and use @align as the property to detect whether
or not to reuse vmemmap areas (Dan)
Thanks,
Joao
Joao Martins (11):
memory-failure: fetch compound_head after pgmap_pfn_valid()
mm/page_alloc: split prep_compound_page into head and tail subparts
mm/page_alloc: refactor memmap_init_zone_device() page init
mm/memremap: add ZONE_DEVICE support for compound pages
mm/sparse-vmemmap: add a pgmap argument to section activation
mm/sparse-vmemmap: refactor vmemmap_populate_basepages()
mm/sparse-vmemmap: populate compound pagemaps
mm/sparse-vmemmap: use hugepages for PUD compound pagemaps
mm/page_alloc: reuse tail struct pages for compound pagemaps
device-dax: compound pagemap support
mm/gup: grab head page refcount once for group of subpages
drivers/dax/device.c | 58 +++++++--
include/linux/memory_hotplug.h | 5 +-
include/linux/memremap.h | 13 ++
include/linux/mm.h | 8 +-
mm/gup.c | 52 +++++---
mm/memory-failure.c | 2 +
mm/memory_hotplug.c | 3 +-
mm/memremap.c | 9 +-
mm/page_alloc.c | 126 +++++++++++++------
mm/sparse-vmemmap.c | 221 +++++++++++++++++++++++++++++----
mm/sparse.c | 24 ++--
11 files changed, 406 insertions(+), 115 deletions(-)
--
2.17.1
1 year, 1 month
[PATCH v19 0/8] mm: introduce memfd_secret system call to create "secret" memory areas
by Mike Rapoport
From: Mike Rapoport <rppt(a)linux.ibm.com>
Hi,
@Andrew, this is based on v5.13-rc1, I can rebase whatever way you prefer.
This is an implementation of "secret" mappings backed by a file descriptor.
The file descriptor backing secret memory mappings is created using a
dedicated memfd_secret system call The desired protection mode for the
memory is configured using flags parameter of the system call. The mmap()
of the file descriptor created with memfd_secret() will create a "secret"
memory mapping. The pages in that mapping will be marked as not present in
the direct map and will be present only in the page table of the owning mm.
Although normally Linux userspace mappings are protected from other users,
such secret mappings are useful for environments where a hostile tenant is
trying to trick the kernel into giving them access to other tenants
mappings.
It's designed to provide the following protections:
* Enhanced protection (in conjunction with all the other in-kernel
attack prevention systems) against ROP attacks. Seceretmem makes "simple"
ROP insufficient to perform exfiltration, which increases the required
complexity of the attack. Along with other protections like the kernel
stack size limit and address space layout randomization which make finding
gadgets is really hard, absence of any in-kernel primitive for accessing
secret memory means the one gadget ROP attack can't work. Since the only
way to access secret memory is to reconstruct the missing mapping entry,
the attacker has to recover the physical page and insert a PTE pointing to
it in the kernel and then retrieve the contents. That takes at least three
gadgets which is a level of difficulty beyond most standard attacks.
* Prevent cross-process secret userspace memory exposures. Once the secret
memory is allocated, the user can't accidentally pass it into the kernel to
be transmitted somewhere. The secreremem pages cannot be accessed via the
direct map and they are disallowed in GUP.
* Harden against exploited kernel flaws. In order to access secretmem, a
kernel-side attack would need to either walk the page tables and create new
ones, or spawn a new privileged uiserspace process to perform secrets
exfiltration using ptrace.
In the future the secret mappings may be used as a mean to protect guest memory
in a virtual machine host.
For demonstration of secret memory usage we've created a userspace library
https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloa...
that does two things: the first is act as a preloader for openssl to
redirect all the OPENSSL_malloc calls to secret memory meaning any secret
keys get automatically protected this way and the other thing it does is
expose the API to the user who needs it. We anticipate that a lot of the
use cases would be like the openssl one: many toolkits that deal with
secret keys already have special handling for the memory to try to give
them greater protection, so this would simply be pluggable into the
toolkits without any need for user application modification.
Hiding secret memory mappings behind an anonymous file allows usage of
the page cache for tracking pages allocated for the "secret" mappings as
well as using address_space_operations for e.g. page migration callbacks.
The anonymous file may be also used implicitly, like hugetlb files, to
implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm
ABIs in the future.
Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which affects
the system performance. However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice". Hence, it is sufficient to have
secretmem disabled by default with the ability of a system administrator to
enable it at boot time.
In addition, there is also a long term goal to improve management of the
direct map.
[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@lin...
v19:
* block /dev/mem mmap access, per David
* disallow mmap/mprotect with PROT_EXEC, per Kees
* simplify return in page_is_secretmem(), per Matthew
* use unsigned int for syscall falgs, per Yury
v18: https://lore.kernel.org/lkml/20210303162209.8609-1-rppt@kernel.org
* rebase on v5.12-rc1
* merge kfence fix into the original patch
* massage commit message of the patch introducing the memfd_secret syscall
v17: https://lore.kernel.org/lkml/20210208084920.2884-1-rppt@kernel.org
* Remove pool of large pages backing secretmem allocations, per Michal Hocko
* Add secretmem pages to unevictable LRU, per Michal Hocko
* Use GFP_HIGHUSER as secretmem mapping mask, per Michal Hocko
* Make secretmem an opt-in feature that is disabled by default
v16: https://lore.kernel.org/lkml/20210121122723.3446-1-rppt@kernel.org
* Fix memory leak intorduced in v15
* Clean the data left from previous page user before handing the page to
the userspace
v15: https://lore.kernel.org/lkml/20210120180612.1058-1-rppt@kernel.org
* Add riscv/Kconfig update to disable set_memory operations for nommu
builds (patch 3)
* Update the code around add_to_page_cache() per Matthew's comments
(patches 6,7)
* Add fixups for build/checkpatch errors discovered by CI systems
Older history:
v14: https://lore.kernel.org/lkml/20201203062949.5484-1-rppt@kernel.org
v13: https://lore.kernel.org/lkml/20201201074559.27742-1-rppt@kernel.org
v12: https://lore.kernel.org/lkml/20201125092208.12544-1-rppt@kernel.org
v11: https://lore.kernel.org/lkml/20201124092556.12009-1-rppt@kernel.org
v10: https://lore.kernel.org/lkml/20201123095432.5860-1-rppt@kernel.org
v9: https://lore.kernel.org/lkml/20201117162932.13649-1-rppt@kernel.org
v8: https://lore.kernel.org/lkml/20201110151444.20662-1-rppt@kernel.org
v7: https://lore.kernel.org/lkml/20201026083752.13267-1-rppt@kernel.org
v6: https://lore.kernel.org/lkml/20200924132904.1391-1-rppt@kernel.org
v5: https://lore.kernel.org/lkml/20200916073539.3552-1-rppt@kernel.org
v4: https://lore.kernel.org/lkml/20200818141554.13945-1-rppt@kernel.org
v3: https://lore.kernel.org/lkml/20200804095035.18778-1-rppt@kernel.org
v2: https://lore.kernel.org/lkml/20200727162935.31714-1-rppt@kernel.org
v1: https://lore.kernel.org/lkml/20200720092435.17469-1-rppt@kernel.org
rfc-v2: https://lore.kernel.org/lkml/20200706172051.19465-1-rppt@kernel.org/
rfc-v1: https://lore.kernel.org/lkml/20200130162340.GA14232@rapoport-lnx/
rfc-v0: https://lore.kernel.org/lkml/1572171452-7958-1-git-send-email-rppt@kernel...
Mike Rapoport (8):
mmap: make mlock_future_check() global
riscv/Kconfig: make direct map manipulation options depend on MMU
set_memory: allow set_direct_map_*_noflush() for multiple pages
set_memory: allow querying whether set_direct_map_*() is actually enabled
mm: introduce memfd_secret system call to create "secret" memory areas
PM: hibernate: disable when there are active secretmem users
arch, mm: wire up memfd_secret system call where relevant
secretmem: test: add basic selftest for memfd_secret(2)
arch/arm64/include/asm/Kbuild | 1 -
arch/arm64/include/asm/cacheflush.h | 6 -
arch/arm64/include/asm/kfence.h | 2 +-
arch/arm64/include/asm/set_memory.h | 17 ++
arch/arm64/include/uapi/asm/unistd.h | 1 +
arch/arm64/kernel/machine_kexec.c | 1 +
arch/arm64/mm/mmu.c | 6 +-
arch/arm64/mm/pageattr.c | 23 +-
arch/riscv/Kconfig | 4 +-
arch/riscv/include/asm/set_memory.h | 4 +-
arch/riscv/include/asm/unistd.h | 1 +
arch/riscv/mm/pageattr.c | 8 +-
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/include/asm/set_memory.h | 4 +-
arch/x86/mm/pat/set_memory.c | 8 +-
drivers/char/mem.c | 4 +
include/linux/secretmem.h | 54 ++++
include/linux/set_memory.h | 16 +-
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 7 +-
include/uapi/linux/magic.h | 1 +
kernel/power/hibernate.c | 5 +-
kernel/power/snapshot.c | 4 +-
kernel/sys_ni.c | 2 +
mm/Kconfig | 4 +
mm/Makefile | 1 +
mm/gup.c | 12 +
mm/internal.h | 3 +
mm/mlock.c | 3 +-
mm/mmap.c | 5 +-
mm/secretmem.c | 254 +++++++++++++++++++
mm/vmalloc.c | 5 +-
scripts/checksyscalls.sh | 4 +
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 3 +-
tools/testing/selftests/vm/memfd_secret.c | 296 ++++++++++++++++++++++
tools/testing/selftests/vm/run_vmtests.sh | 17 ++
38 files changed, 744 insertions(+), 46 deletions(-)
create mode 100644 arch/arm64/include/asm/set_memory.h
create mode 100644 include/linux/secretmem.h
create mode 100644 mm/secretmem.c
create mode 100644 tools/testing/selftests/vm/memfd_secret.c
base-commit: 6efb943b8616ec53a5e444193dccf1af9ad627b5
--
2.28.0
1 year, 1 month
[PATCH v3] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible
by Vaibhav Jain
Currently drc_pmem_qeury_stats() generates a dev_err in case
"Enable Performance Information Collection" feature is disabled from
HMC or performance stats are not available for an nvdimm. The error is
of the form below:
papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query
performance stats, Err:-10
This error message confuses users as it implies a possible problem
with the nvdimm even though its due to a disabled/unavailable
feature. We fix this by explicitly handling the H_AUTHORITY and
H_UNSUPPORTED errors from the H_SCM_PERFORMANCE_STATS hcall.
In case of H_AUTHORITY error an info message is logged instead of an
error, saying that "Permission denied while accessing performance
stats" and an EPERM error is returned back.
In case of H_UNSUPPORTED error we return a EOPNOTSUPP error back from
drc_pmem_query_stats() indicating that performance stats-query
operation is not supported on this nvdimm.
Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from PHYP')
Signed-off-by: Vaibhav Jain <vaibhav(a)linux.ibm.com>
---
Changelog
v3:
* Return EOPNOTSUPP error in case of H_UNSUPPORTED [ Ira ]
* Return EPERM in case of H_AUTHORITY [ Ira ]
* Updated patch description
v2:
* Updated the message logged in case of H_AUTHORITY error [ Ira ]
* Switched from dev_warn to dev_info in case of H_AUTHORITY error.
* Instead of -EPERM return -EACCESS for H_AUTHORITY error.
* Added explicit handling of H_UNSUPPORTED error.
---
arch/powerpc/platforms/pseries/papr_scm.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
index ef26fe40efb0..e2b69cc3beaf 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -310,6 +310,13 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv *p,
dev_err(&p->pdev->dev,
"Unknown performance stats, Err:0x%016lX\n", ret[0]);
return -ENOENT;
+ } else if (rc == H_AUTHORITY) {
+ dev_info(&p->pdev->dev,
+ "Permission denied while accessing performance stats");
+ return -EPERM;
+ } else if (rc == H_UNSUPPORTED) {
+ dev_dbg(&p->pdev->dev, "Performance stats unsupported\n");
+ return -EOPNOTSUPP;
} else if (rc != H_SUCCESS) {
dev_err(&p->pdev->dev,
"Failed to query performance stats, Err:%lld\n", rc);
--
2.31.1
1 year, 1 month
[PATCH v6 0/7] fsdax,xfs: Add reflink&dedupe support for fsdax
by Shiyang Ruan
This patchset is attempt to add CoW support for fsdax, and take XFS,
which has both reflink and fsdax feature, as an example.
Changes from V5:
- Fix the lock order of xfs_inode in xfs_mmaplock_two_inodes_and_break_dax_layout()
- move dax_remap_file_range_prep() to fs/dax.c
- change type of length to uint64_t in dax_iomap_cow_copy()
- fix mistake in dax_iomap_zero()
Changes from V4:
- Fix the mistake of breaking dax layout for two inodes
- Add CONFIG_FS_DAX judgement for fsdax code in remap_range.c
- Fix other small problems and mistakes
One of the key mechanism need to be implemented in fsdax is CoW. Copy
the data from srcmap before we actually write data to the destance
iomap. And we just copy range in which data won't be changed.
Another mechanism is range comparison. In page cache case, readpage()
is used to load data on disk to page cache in order to be able to
compare data. In fsdax case, readpage() does not work. So, we need
another compare data with direct access support.
With the two mechanisms implemented in fsdax, we are able to make reflink
and fsdax work together in XFS.
Some of the patches are picked up from Goldwyn's patchset. I made some
changes to adapt to this patchset.
(Rebased on v5.13-rc2 and patchset[1])
[1]: https://lkml.org/lkml/2021/4/22/575
Shiyang Ruan (7):
fsdax: Introduce dax_iomap_cow_copy()
fsdax: Replace mmap entry in case of CoW
fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero
iomap: Introduce iomap_apply2() for operations on two files
fsdax: Dedup file range to use a compare function
fs/xfs: Handle CoW for fsdax write() path
fs/xfs: Add dax dedupe support
fs/dax.c | 216 ++++++++++++++++++++++++++++++++++++-----
fs/iomap/apply.c | 52 ++++++++++
fs/iomap/buffered-io.c | 2 +-
fs/remap_range.c | 36 +++++--
fs/xfs/xfs_bmap_util.c | 3 +-
fs/xfs/xfs_file.c | 11 +--
fs/xfs/xfs_inode.c | 57 +++++++++++
fs/xfs/xfs_inode.h | 1 +
fs/xfs/xfs_iomap.c | 38 +++++++-
fs/xfs/xfs_iomap.h | 24 +++++
fs/xfs/xfs_iops.c | 7 +-
fs/xfs/xfs_reflink.c | 15 +--
include/linux/dax.h | 11 ++-
include/linux/fs.h | 12 ++-
include/linux/iomap.h | 7 +-
15 files changed, 431 insertions(+), 61 deletions(-)
--
2.31.1
1 year, 1 month
[PATCH v20 0/7] mm: introduce memfd_secret system call to create "secret" memory areas
by Mike Rapoport
From: Mike Rapoport <rppt(a)linux.ibm.com>
Hi,
@Andrew, this is based on v5.13-rc1, I can rebase whatever way you prefer.
This is an implementation of "secret" mappings backed by a file descriptor.
The file descriptor backing secret memory mappings is created using a
dedicated memfd_secret system call The desired protection mode for the
memory is configured using flags parameter of the system call. The mmap()
of the file descriptor created with memfd_secret() will create a "secret"
memory mapping. The pages in that mapping will be marked as not present in
the direct map and will be present only in the page table of the owning mm.
Although normally Linux userspace mappings are protected from other users,
such secret mappings are useful for environments where a hostile tenant is
trying to trick the kernel into giving them access to other tenants
mappings.
It's designed to provide the following protections:
* Enhanced protection (in conjunction with all the other in-kernel
attack prevention systems) against ROP attacks. Seceretmem makes "simple"
ROP insufficient to perform exfiltration, which increases the required
complexity of the attack. Along with other protections like the kernel
stack size limit and address space layout randomization which make finding
gadgets is really hard, absence of any in-kernel primitive for accessing
secret memory means the one gadget ROP attack can't work. Since the only
way to access secret memory is to reconstruct the missing mapping entry,
the attacker has to recover the physical page and insert a PTE pointing to
it in the kernel and then retrieve the contents. That takes at least three
gadgets which is a level of difficulty beyond most standard attacks.
* Prevent cross-process secret userspace memory exposures. Once the secret
memory is allocated, the user can't accidentally pass it into the kernel to
be transmitted somewhere. The secreremem pages cannot be accessed via the
direct map and they are disallowed in GUP.
* Harden against exploited kernel flaws. In order to access secretmem, a
kernel-side attack would need to either walk the page tables and create new
ones, or spawn a new privileged uiserspace process to perform secrets
exfiltration using ptrace.
In the future the secret mappings may be used as a mean to protect guest memory
in a virtual machine host.
For demonstration of secret memory usage we've created a userspace library
https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloa...
that does two things: the first is act as a preloader for openssl to
redirect all the OPENSSL_malloc calls to secret memory meaning any secret
keys get automatically protected this way and the other thing it does is
expose the API to the user who needs it. We anticipate that a lot of the
use cases would be like the openssl one: many toolkits that deal with
secret keys already have special handling for the memory to try to give
them greater protection, so this would simply be pluggable into the
toolkits without any need for user application modification.
Hiding secret memory mappings behind an anonymous file allows usage of
the page cache for tracking pages allocated for the "secret" mappings as
well as using address_space_operations for e.g. page migration callbacks.
The anonymous file may be also used implicitly, like hugetlb files, to
implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm
ABIs in the future.
Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which affects
the system performance. However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice". Hence, it is sufficient to have
secretmem disabled by default with the ability of a system administrator to
enable it at boot time.
In addition, there is also a long term goal to improve management of the
direct map.
[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@lin...
v20:
* Drop the patch that enable multi-page updates to the direct map, per David
* Drop the changes to /dev/mem, they anyway have no effect when CONFIG_STRICT_DEVMEM=y
* Add Acked-by and Reviewed-by tags
v19: https://lore.kernel.org/lkml/20210513184734.29317-1-rppt@kernel.org
* block /dev/mem mmap access, per David
* disallow mmap/mprotect with PROT_EXEC, per Kees
* simplify return in page_is_secretmem(), per Matthew
* use unsigned int for syscall falgs, per Yury
v18: https://lore.kernel.org/lkml/20210303162209.8609-1-rppt@kernel.org
* rebase on v5.12-rc1
* merge kfence fix into the original patch
* massage commit message of the patch introducing the memfd_secret syscall
v17: https://lore.kernel.org/lkml/20210208084920.2884-1-rppt@kernel.org
* Remove pool of large pages backing secretmem allocations, per Michal Hocko
* Add secretmem pages to unevictable LRU, per Michal Hocko
* Use GFP_HIGHUSER as secretmem mapping mask, per Michal Hocko
* Make secretmem an opt-in feature that is disabled by default
v16: https://lore.kernel.org/lkml/20210121122723.3446-1-rppt@kernel.org
* Fix memory leak intorduced in v15
* Clean the data left from previous page user before handing the page to
the userspace
Older history:
v15: https://lore.kernel.org/lkml/20210120180612.1058-1-rppt@kernel.org
v14: https://lore.kernel.org/lkml/20201203062949.5484-1-rppt@kernel.org
v13: https://lore.kernel.org/lkml/20201201074559.27742-1-rppt@kernel.org
v12: https://lore.kernel.org/lkml/20201125092208.12544-1-rppt@kernel.org
v11: https://lore.kernel.org/lkml/20201124092556.12009-1-rppt@kernel.org
v10: https://lore.kernel.org/lkml/20201123095432.5860-1-rppt@kernel.org
v9: https://lore.kernel.org/lkml/20201117162932.13649-1-rppt@kernel.org
v8: https://lore.kernel.org/lkml/20201110151444.20662-1-rppt@kernel.org
v7: https://lore.kernel.org/lkml/20201026083752.13267-1-rppt@kernel.org
v6: https://lore.kernel.org/lkml/20200924132904.1391-1-rppt@kernel.org
v5: https://lore.kernel.org/lkml/20200916073539.3552-1-rppt@kernel.org
v4: https://lore.kernel.org/lkml/20200818141554.13945-1-rppt@kernel.org
v3: https://lore.kernel.org/lkml/20200804095035.18778-1-rppt@kernel.org
v2: https://lore.kernel.org/lkml/20200727162935.31714-1-rppt@kernel.org
v1: https://lore.kernel.org/lkml/20200720092435.17469-1-rppt@kernel.org
rfc-v2: https://lore.kernel.org/lkml/20200706172051.19465-1-rppt@kernel.org/
rfc-v1: https://lore.kernel.org/lkml/20200130162340.GA14232@rapoport-lnx/
rfc-v0: https://lore.kernel.org/lkml/1572171452-7958-1-git-send-email-rppt@kernel...
Mike Rapoport (7):
mmap: make mlock_future_check() global
riscv/Kconfig: make direct map manipulation options depend on MMU
set_memory: allow querying whether set_direct_map_*() is actually
enabled
mm: introduce memfd_secret system call to create "secret" memory areas
PM: hibernate: disable when there are active secretmem users
arch, mm: wire up memfd_secret system call where relevant
secretmem: test: add basic selftest for memfd_secret(2)
arch/arm64/include/asm/Kbuild | 1 -
arch/arm64/include/asm/cacheflush.h | 6 -
arch/arm64/include/asm/kfence.h | 2 +-
arch/arm64/include/asm/set_memory.h | 17 ++
arch/arm64/include/uapi/asm/unistd.h | 1 +
arch/arm64/kernel/machine_kexec.c | 1 +
arch/arm64/mm/mmu.c | 6 +-
arch/arm64/mm/pageattr.c | 13 +-
arch/riscv/Kconfig | 4 +-
arch/riscv/include/asm/unistd.h | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
include/linux/secretmem.h | 54 ++++
include/linux/set_memory.h | 12 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 7 +-
include/uapi/linux/magic.h | 1 +
kernel/power/hibernate.c | 5 +-
kernel/sys_ni.c | 2 +
mm/Kconfig | 5 +
mm/Makefile | 1 +
mm/gup.c | 12 +
mm/internal.h | 3 +
mm/mlock.c | 3 +-
mm/mmap.c | 5 +-
mm/secretmem.c | 254 +++++++++++++++++++
scripts/checksyscalls.sh | 4 +
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 3 +-
tools/testing/selftests/vm/memfd_secret.c | 296 ++++++++++++++++++++++
tools/testing/selftests/vm/run_vmtests.sh | 17 ++
31 files changed, 716 insertions(+), 24 deletions(-)
create mode 100644 arch/arm64/include/asm/set_memory.h
create mode 100644 include/linux/secretmem.h
create mode 100644 mm/secretmem.c
create mode 100644 tools/testing/selftests/vm/memfd_secret.c
base-commit: 6efb943b8616ec53a5e444193dccf1af9ad627b5
--
2.28.0
1 year, 1 month
[ndctl PATCH] ndctl: Update nvdimm mailing list address
by Vishal Verma
The 'nvdimm' mailing list has moved from lists.01.org to
lists.linux.dev. Update CONTRIBUTING.md and configure.ac to reflect
this.
Cc: Dan Williams <dan.j.williams(a)intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma(a)intel.com>
---
configure.ac | 2 +-
CONTRIBUTING.md | 7 ++++---
2 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/configure.ac b/configure.ac
index 5ec8d2f..dc39dbe 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2,7 +2,7 @@ AC_PREREQ(2.60)
m4_include([version.m4])
AC_INIT([ndctl],
GIT_VERSION,
- [linux-nvdimm(a)lists.01.org],
+ [nvdimm(a)lists.linux.dev],
[ndctl],
[https://github.com/pmem/ndctl])
AC_CONFIG_SRCDIR([ndctl/lib/libndctl.c])
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 4c29d31..4f4865d 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -6,13 +6,14 @@ The following is a set of guidelines that we adhere to, and request that
contributors follow.
1. The libnvdimm (kernel subsystem) and ndctl developers primarily use
- the [linux-nvdimm](https://lists.01.org/postorius/lists/linux-nvdimm.lists.01....
+ the [nvdimm](https://subspace.kernel.org/lists.linux.dev.html)
mailing list for everything. It is recommended to send patches to
- **```linux-nvdimm(a)lists.01.org```**
+ **```nvdimm(a)lists.linux.dev```**
+ An archive is available on [lore](https://lore.kernel.org/nvdimm/)
1. Github [issues](https://github.com/pmem/ndctl/issues) are an acceptable
way to report a problem, but if you just have a question,
- [email](mailto:linux-nvdimm@lists.01.org) the above list.
+ [email](mailto:nvdimm@lists.linux.dev) the above list.
1. We follow the Linux Kernel [Coding Style Guide][cs] as applicable.
base-commit: a2a6fda4d7e93044fca4c67870d2ff7e193d3cf1
prerequisite-patch-id: 8fc5baaf64b312b2459acea255740f79a23b76cd
--
2.31.1
1 year, 1 month
[ndctl PATCH] libndctl/papr: Fix probe for papr-scm compatible nvdimms
by Vaibhav Jain
With recent changes introduced for unification of PAPR and NFIT
families the probe for papr-scm nvdimms is broken since they don't
expose 'handle' or 'phys_id' sysfs attributes. These attributes are
only exposed by NFIT and 'nvdimm_test' nvdimms. Since 'unable to read'
these sysfs attributes is a non-recoverable error hence this prevents
probing of 'PAPR-SCM' nvdimms and ndctl reports following error:
$ sudo NDCTL_LOG=debug ndctl list -DH
libndctl: ndctl_new: ctx 0x10015342c70 created
libndctl: add_dimm: nmem1: probe failed: Operation not permitted
libndctl: __sysfs_device_parse: nmem1: add_dev() failed
libndctl: add_dimm: nmem0: probe failed: Operation not permitted
libndctl: __sysfs_device_parse: nmem0: add_dev() failed
Fixing this bug is complicated by the fact these attributes are needed
for by the 'nvdimm_test' nvdimms which also uses the
NVDIMM_FAMILY_PAPR. Adding a two way comparison for these two
attributes in populate_dimm_attributes() to distinguish between
'nvdimm_test' and papr-scm nvdimms will be clunky and make future
updates to populate_dimm_attributes() error prone.
So, this patch proposes to fix the issue by re-introducing
add_papr_dimm() to probe both papr-scm and 'nvdimm_test' nvdimms. The
'compatible' sysfs attribute associated with the PAPR device is used
to distinguish between the two nvdimm types and in case an
'nvdimm_test' device is detected then forward its probe to
populate_dimm_attributes().
Fixes: daef3a386a9c("libndctl: Unify adding dimms for papr and nfit
families")
Signed-off-by: Vaibhav Jain <vaibhav(a)linux.ibm.com>
---
ndctl/lib/libndctl.c | 57 ++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 55 insertions(+), 2 deletions(-)
diff --git a/ndctl/lib/libndctl.c b/ndctl/lib/libndctl.c
index bf0968cce93f..0417720ccd7e 100644
--- a/ndctl/lib/libndctl.c
+++ b/ndctl/lib/libndctl.c
@@ -1757,6 +1757,58 @@ static int populate_dimm_attributes(struct ndctl_dimm *dimm,
return rc;
}
+static int add_papr_dimm(struct ndctl_dimm *dimm, const char *dimm_base)
+{
+ int rc = -ENODEV;
+ char buf[SYSFS_ATTR_SIZE];
+ struct ndctl_ctx *ctx = dimm->bus->ctx;
+ char *path = calloc(1, strlen(dimm_base) + 100);
+ const char * const devname = ndctl_dimm_get_devname(dimm);
+
+ dbg(ctx, "%s: Probing of_pmem dimm at %s\n", devname, dimm_base);
+
+ if (!path)
+ return -ENOMEM;
+
+ /* Check the compatibility of the probed nvdimm */
+ sprintf(path, "%s/../of_node/compatible", dimm_base);
+ if (sysfs_read_attr(ctx, path, buf) < 0) {
+ dbg(ctx, "%s: Unable to read compatible field\n", devname);
+ rc = -ENODEV;
+ goto out;
+ }
+
+ dbg(ctx, "%s:Compatible of_pmem = '%s'\n", devname, buf);
+
+ /* Probe for papr-scm memory */
+ if (strcmp(buf, "ibm,pmemory") == 0) {
+ /* Read the dimm flags file */
+ sprintf(path, "%s/papr/flags", dimm_base);
+ if (sysfs_read_attr(ctx, path, buf) < 0) {
+ rc = -errno;
+ err(ctx, "%s: Unable to read dimm-flags\n", devname);
+ goto out;
+ }
+
+ dbg(ctx, "%s: Adding papr-scm dimm flags:\"%s\"\n", devname, buf);
+ dimm->cmd_family = NVDIMM_FAMILY_PAPR;
+
+ /* Parse dimm flags */
+ parse_papr_flags(dimm, buf);
+
+ /* Allocate monitor mode fd */
+ dimm->health_eventfd = open(path, O_RDONLY|O_CLOEXEC);
+ rc = 0;
+
+ } else if (strcmp(buf, "nvdimm_test") == 0) {
+ /* probe via common populate_dimm_attributes() */
+ rc = populate_dimm_attributes(dimm, dimm_base, "papr");
+ }
+out:
+ free(path);
+ return rc;
+}
+
static void *add_dimm(void *parent, int id, const char *dimm_base)
{
int formats, i, rc = -ENODEV;
@@ -1848,8 +1900,9 @@ static void *add_dimm(void *parent, int id, const char *dimm_base)
/* Check if the given dimm supports nfit */
if (ndctl_bus_has_nfit(bus)) {
rc = populate_dimm_attributes(dimm, dimm_base, "nfit");
- } else if (ndctl_bus_has_of_node(bus))
- rc = populate_dimm_attributes(dimm, dimm_base, "papr");
+ } else if (ndctl_bus_has_of_node(bus)) {
+ rc = add_papr_dimm(dimm, dimm_base);
+ }
if (rc == -ENODEV) {
/* Unprobed dimm with no family */
--
2.31.1
1 year, 1 month
[ndctl v2 1/4] libndctl: Rename dimm property nfit_dsm_mask for generic use
by Santosh Sivaraj
From: Shivaprasad G Bhat <sbhat(a)linux.vnet.ibm.com>
The dimm specific dsm masks can be used by different platforms.
Rename it to dsm_mask to avoid confusion.
Signed-off-by: Shivaprasad G Bhat <sbhat(a)linux.vnet.ibm.com>
---
ndctl/lib/libndctl.c | 4 ++--
ndctl/lib/private.h | 6 +++---
2 files changed, 5 insertions(+), 5 deletions(-)
Resending both the SMART test and error injection patches as one series. Will be
easier for review and testing.
diff --git a/ndctl/lib/libndctl.c b/ndctl/lib/libndctl.c
index bf0968c..a148438 100644
--- a/ndctl/lib/libndctl.c
+++ b/ndctl/lib/libndctl.c
@@ -1728,7 +1728,7 @@ static int populate_dimm_attributes(struct ndctl_dimm *dimm,
sprintf(path, "%s/%s/dsm_mask", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) == 0)
- dimm->nfit_dsm_mask = strtoul(buf, NULL, 0);
+ dimm->dsm_mask = strtoul(buf, NULL, 0);
sprintf(path, "%s/%s/format", dimm_base, bus_prefix);
if (sysfs_read_attr(ctx, path, buf) == 0)
@@ -1821,7 +1821,7 @@ static void *add_dimm(void *parent, int id, const char *dimm_base)
dimm->manufacturing_date = -1;
dimm->manufacturing_location = -1;
dimm->cmd_family = -1;
- dimm->nfit_dsm_mask = ULONG_MAX;
+ dimm->dsm_mask = ULONG_MAX;
for (i = 0; i < formats; i++)
dimm->format[i] = -1;
diff --git a/ndctl/lib/private.h b/ndctl/lib/private.h
index 8f4510e..53fae0f 100644
--- a/ndctl/lib/private.h
+++ b/ndctl/lib/private.h
@@ -68,7 +68,7 @@ struct ndctl_dimm {
unsigned char manufacturing_location;
unsigned long cmd_family;
unsigned long cmd_mask;
- unsigned long nfit_dsm_mask;
+ unsigned long dsm_mask;
long long dirty_shutdown;
enum ndctl_fwa_state fwa_state;
enum ndctl_fwa_result fwa_result;
@@ -105,9 +105,9 @@ enum dsm_support {
static inline enum dsm_support test_dimm_dsm(struct ndctl_dimm *dimm, int fn)
{
- if (dimm->nfit_dsm_mask == ULONG_MAX) {
+ if (dimm->dsm_mask == ULONG_MAX) {
return DIMM_DSM_UNKNOWN;
- } else if (dimm->nfit_dsm_mask & (1 << fn))
+ } else if (dimm->dsm_mask & (1 << fn))
return DIMM_DSM_SUPPORTED;
return DIMM_DSM_UNSUPPORTED;
}
--
2.31.1
1 year, 1 month
[v2 1/2] tests/nvdimm/ndtest: Enable smart tests
by Santosh Sivaraj
From: Shivaprasad G Bhat <sbhat(a)linux.vnet.ibm.com>
The patch adds necessary health related dsm command implementations for
the ndctl inject-smart and monitor tests to pass.
Signed-off-by: Shivaprasad G Bhat <sbhat(a)linux.vnet.ibm.com>
---
tools/testing/nvdimm/test/ndtest.c | 258 +++++++++++++++++++++++++++++
tools/testing/nvdimm/test/ndtest.h | 129 +++++++++++++++
2 files changed, 387 insertions(+)
diff --git a/tools/testing/nvdimm/test/ndtest.c b/tools/testing/nvdimm/test/ndtest.c
index 6862915f1fb0..bb47b145466d 100644
--- a/tools/testing/nvdimm/test/ndtest.c
+++ b/tools/testing/nvdimm/test/ndtest.c
@@ -30,6 +30,8 @@ enum {
((1ul << ND_CMD_GET_CONFIG_SIZE) | \
(1ul << ND_CMD_GET_CONFIG_DATA) | \
(1ul << ND_CMD_SET_CONFIG_DATA) | \
+ (1ul << ND_CMD_SMART_THRESHOLD) | \
+ (1uL << ND_CMD_SMART) | \
(1ul << ND_CMD_CALL))
#define NFIT_DIMM_HANDLE(node, socket, imc, chan, dimm) \
@@ -41,6 +43,21 @@ static struct ndtest_priv *instances[NUM_INSTANCES];
static struct class *ndtest_dimm_class;
static struct gen_pool *ndtest_pool;
+static const struct nd_papr_pdsm_health health_defaults = {
+ .dimm_unarmed = 0,
+ .dimm_bad_shutdown = 0,
+ .dimm_health = PAPR_PDSM_DIMM_UNHEALTHY,
+ .extension_flags = PDSM_DIMM_HEALTH_MEDIA_TEMPERATURE_VALID | PDSM_DIMM_HEALTH_ALARM_VALID |
+ PDSM_DIMM_HEALTH_CTRL_TEMPERATURE_VALID | PDSM_DIMM_HEALTH_SPARES_VALID |
+ PDSM_DIMM_HEALTH_RUN_GAUGE_VALID,
+ .dimm_fuel_gauge = 95,
+ .media_temperature = 23 * 16,
+ .ctrl_temperature = 25 * 16,
+ .spares = 75,
+ .alarm_flags = ND_PAPR_HEALTH_SPARE_TRIP |
+ ND_PAPR_HEALTH_TEMP_TRIP,
+};
+
static struct ndtest_dimm dimm_group1[] = {
{
.size = DIMM_SIZE,
@@ -48,6 +65,16 @@ static struct ndtest_dimm dimm_group1[] = {
.uuid_str = "1e5c75d2-b618-11ea-9aa3-507b9ddc0f72",
.physical_id = 0,
.num_formats = 2,
+ .flags = PAPR_PMEM_HEALTH_NON_CRITICAL,
+ .extension_flags = health_defaults.extension_flags,
+ .dimm_fuel_gauge = health_defaults.dimm_fuel_gauge,
+ .media_temperature = health_defaults.media_temperature,
+ .ctrl_temperature = health_defaults.ctrl_temperature,
+ .spares = health_defaults.spares,
+ .alarm_flags = health_defaults.alarm_flags,
+ .media_temperature_threshold = 40 * 16,
+ .ctrl_temperature_threshold = 30 * 16,
+ .spares_threshold = 5,
},
{
.size = DIMM_SIZE,
@@ -55,6 +82,16 @@ static struct ndtest_dimm dimm_group1[] = {
.uuid_str = "1c4d43ac-b618-11ea-be80-507b9ddc0f72",
.physical_id = 1,
.num_formats = 2,
+ .flags = PAPR_PMEM_HEALTH_NON_CRITICAL,
+ .extension_flags = health_defaults.extension_flags,
+ .dimm_fuel_gauge = health_defaults.dimm_fuel_gauge,
+ .media_temperature = health_defaults.media_temperature,
+ .ctrl_temperature = health_defaults.ctrl_temperature,
+ .spares = health_defaults.spares,
+ .alarm_flags = health_defaults.alarm_flags,
+ .media_temperature_threshold = 40 * 16,
+ .ctrl_temperature_threshold = 30 * 16,
+ .spares_threshold = 5,
},
{
.size = DIMM_SIZE,
@@ -62,6 +99,16 @@ static struct ndtest_dimm dimm_group1[] = {
.uuid_str = "a9f17ffc-b618-11ea-b36d-507b9ddc0f72",
.physical_id = 2,
.num_formats = 2,
+ .flags = PAPR_PMEM_HEALTH_NON_CRITICAL,
+ .extension_flags = health_defaults.extension_flags,
+ .dimm_fuel_gauge = health_defaults.dimm_fuel_gauge,
+ .media_temperature = health_defaults.media_temperature,
+ .ctrl_temperature = health_defaults.ctrl_temperature,
+ .spares = health_defaults.spares,
+ .alarm_flags = health_defaults.alarm_flags,
+ .media_temperature_threshold = 40 * 16,
+ .ctrl_temperature_threshold = 30 * 16,
+ .spares_threshold = 5,
},
{
.size = DIMM_SIZE,
@@ -69,6 +116,16 @@ static struct ndtest_dimm dimm_group1[] = {
.uuid_str = "b6b83b22-b618-11ea-8aae-507b9ddc0f72",
.physical_id = 3,
.num_formats = 2,
+ .flags = PAPR_PMEM_HEALTH_NON_CRITICAL,
+ .extension_flags = health_defaults.extension_flags,
+ .dimm_fuel_gauge = health_defaults.dimm_fuel_gauge,
+ .media_temperature = health_defaults.media_temperature,
+ .ctrl_temperature = health_defaults.ctrl_temperature,
+ .spares = health_defaults.spares,
+ .alarm_flags = health_defaults.alarm_flags,
+ .media_temperature_threshold = 40 * 16,
+ .ctrl_temperature_threshold = 30 * 16,
+ .spares_threshold = 5,
},
{
.size = DIMM_SIZE,
@@ -296,6 +353,172 @@ static int ndtest_get_config_size(struct ndtest_dimm *dimm, unsigned int buf_len
return 0;
}
+static int ndtest_pdsm_health(struct ndtest_dimm *dimm,
+ union nd_pdsm_payload *payload,
+ unsigned int buf_len)
+{
+ struct nd_papr_pdsm_health *health = &payload->health;
+
+ if (buf_len < sizeof(health))
+ return -EINVAL;
+
+ health->extension_flags = 0;
+ health->dimm_unarmed = !!(dimm->flags & PAPR_PMEM_UNARMED_MASK);
+ health->dimm_bad_shutdown = !!(dimm->flags & PAPR_PMEM_BAD_SHUTDOWN_MASK);
+ health->dimm_bad_restore = !!(dimm->flags & PAPR_PMEM_BAD_RESTORE_MASK);
+ health->dimm_health = PAPR_PDSM_DIMM_HEALTHY;
+
+ if (dimm->flags & PAPR_PMEM_HEALTH_FATAL)
+ health->dimm_health = PAPR_PDSM_DIMM_FATAL;
+ else if (dimm->flags & PAPR_PMEM_HEALTH_CRITICAL)
+ health->dimm_health = PAPR_PDSM_DIMM_CRITICAL;
+ else if (dimm->flags & PAPR_PMEM_HEALTH_UNHEALTHY ||
+ dimm->flags & PAPR_PMEM_HEALTH_NON_CRITICAL)
+ health->dimm_health = PAPR_PDSM_DIMM_UNHEALTHY;
+
+ health->extension_flags = 0;
+ if (dimm->extension_flags & PDSM_DIMM_HEALTH_RUN_GAUGE_VALID) {
+ health->dimm_fuel_gauge = dimm->dimm_fuel_gauge;
+ health->extension_flags |= PDSM_DIMM_HEALTH_RUN_GAUGE_VALID;
+ }
+ if (dimm->extension_flags & PDSM_DIMM_HEALTH_MEDIA_TEMPERATURE_VALID) {
+ health->media_temperature = dimm->media_temperature;
+ health->extension_flags |= PDSM_DIMM_HEALTH_MEDIA_TEMPERATURE_VALID;
+ }
+ if (dimm->extension_flags & PDSM_DIMM_HEALTH_CTRL_TEMPERATURE_VALID) {
+ health->ctrl_temperature = dimm->ctrl_temperature;
+ health->extension_flags |= PDSM_DIMM_HEALTH_CTRL_TEMPERATURE_VALID;
+ }
+ if (dimm->extension_flags & PDSM_DIMM_HEALTH_SPARES_VALID) {
+ health->spares = dimm->spares;
+ health->extension_flags |= PDSM_DIMM_HEALTH_SPARES_VALID;
+ }
+ if (dimm->extension_flags & PDSM_DIMM_HEALTH_ALARM_VALID) {
+ health->alarm_flags = dimm->alarm_flags;
+ health->extension_flags |= PDSM_DIMM_HEALTH_ALARM_VALID;
+ }
+
+ return 0;
+}
+
+static void smart_notify(struct ndtest_dimm *dimm)
+{
+ struct device *bus = dimm->dev->parent;
+
+ if (((dimm->alarm_flags & ND_PAPR_HEALTH_SPARE_TRIP) &&
+ dimm->spares <= dimm->spares_threshold) ||
+ ((dimm->alarm_flags & ND_PAPR_HEALTH_TEMP_TRIP) &&
+ dimm->media_temperature >= dimm->media_temperature_threshold) ||
+ ((dimm->alarm_flags & ND_PAPR_HEALTH_CTEMP_TRIP) &&
+ dimm->ctrl_temperature >= dimm->ctrl_temperature_threshold) ||
+ !(dimm->flags & PAPR_PMEM_HEALTH_NON_CRITICAL) ||
+ (dimm->flags & PAPR_PMEM_BAD_SHUTDOWN_MASK)) {
+ device_lock(bus);
+ /* send smart notification */
+ if (dimm->notify_handle)
+ sysfs_notify_dirent(dimm->notify_handle);
+ device_unlock(bus);
+ }
+}
+
+static int ndtest_pdsm_health_inject(struct ndtest_dimm *dimm,
+ union nd_pdsm_payload *payload,
+ unsigned int buf_len)
+{
+ struct nd_papr_pdsm_health_inject *inj = &payload->inject;
+
+ if (buf_len < sizeof(inj))
+ return -EINVAL;
+
+ if (inj->flags & ND_PAPR_HEALTH_INJECT_MTEMP) {
+ if (inj->mtemp_enable)
+ dimm->media_temperature = inj->media_temperature;
+ else
+ dimm->media_temperature = health_defaults.media_temperature;
+ }
+ if (inj->flags & ND_PAPR_HEALTH_INJECT_SPARE) {
+ if (inj->spares_enable)
+ dimm->spares = inj->spares;
+ else
+ dimm->spares = health_defaults.spares;
+ }
+ if (inj->flags & ND_PAPR_HEALTH_INJECT_FATAL) {
+ if (inj->fatal_enable)
+ dimm->flags |= PAPR_PMEM_HEALTH_FATAL;
+ else
+ dimm->flags &= ~PAPR_PMEM_HEALTH_FATAL;
+ }
+ if (inj->flags & ND_PAPR_HEALTH_INJECT_SHUTDOWN) {
+ if (inj->unsafe_shutdown_enable)
+ dimm->flags |= PAPR_PMEM_SHUTDOWN_DIRTY;
+ else
+ dimm->flags &= ~PAPR_PMEM_SHUTDOWN_DIRTY;
+ }
+ smart_notify(dimm);
+ inj->status = 0;
+
+ return 0;
+}
+
+static int ndtest_pdsm_health_threshold(struct ndtest_dimm *dimm,
+ union nd_pdsm_payload *payload,
+ unsigned int buf_len)
+{
+ struct nd_papr_pdsm_health_threshold *threshold = &payload->threshold;
+
+ if (buf_len < sizeof(threshold))
+ return -EINVAL;
+
+ threshold->media_temperature = dimm->media_temperature_threshold;
+ threshold->ctrl_temperature = dimm->ctrl_temperature_threshold;
+ threshold->spares = dimm->spares_threshold;
+ threshold->alarm_control = dimm->alarm_flags;
+
+ return 0;
+}
+
+static int ndtest_pdsm_health_set_threshold(struct ndtest_dimm *dimm,
+ union nd_pdsm_payload *payload,
+ unsigned int buf_len)
+{
+ struct nd_papr_pdsm_health_threshold *threshold = &payload->threshold;
+
+ if (buf_len < sizeof(threshold))
+ return -EINVAL;
+
+ dimm->media_temperature_threshold = threshold->media_temperature;
+ dimm->ctrl_temperature_threshold = threshold->ctrl_temperature;
+ dimm->spares_threshold = threshold->spares;
+ dimm->alarm_flags = threshold->alarm_control;
+
+ smart_notify(dimm);
+
+ return 0;
+}
+
+static int ndtest_dimm_cmd_call(struct ndtest_dimm *dimm, unsigned int buf_len,
+ void *buf)
+{
+ struct nd_cmd_pkg *call_pkg = buf;
+ unsigned int len = call_pkg->nd_size_in + call_pkg->nd_size_out;
+ struct nd_pkg_pdsm *pdsm = (struct nd_pkg_pdsm *) call_pkg->nd_payload;
+ union nd_pdsm_payload *payload = &(pdsm->payload);
+ unsigned int func = call_pkg->nd_command;
+
+ switch (func) {
+ case PAPR_PDSM_HEALTH:
+ return ndtest_pdsm_health(dimm, payload, len);
+ case PAPR_PDSM_HEALTH_INJECT:
+ return ndtest_pdsm_health_inject(dimm, payload, len);
+ case PAPR_PDSM_HEALTH_THRESHOLD:
+ return ndtest_pdsm_health_threshold(dimm, payload, len);
+ case PAPR_PDSM_HEALTH_THRESHOLD_SET:
+ return ndtest_pdsm_health_set_threshold(dimm, payload, len);
+ }
+
+ return 0;
+}
+
static int ndtest_ctl(struct nvdimm_bus_descriptor *nd_desc,
struct nvdimm *nvdimm, unsigned int cmd, void *buf,
unsigned int buf_len, int *cmd_rc)
@@ -325,6 +548,9 @@ static int ndtest_ctl(struct nvdimm_bus_descriptor *nd_desc,
case ND_CMD_SET_CONFIG_DATA:
*cmd_rc = ndtest_config_set(dimm, buf_len, buf);
break;
+ case ND_CMD_CALL:
+ *cmd_rc = ndtest_dimm_cmd_call(dimm, buf_len, buf);
+ break;
default:
return -EINVAL;
}
@@ -826,6 +1052,20 @@ static ssize_t flags_show(struct device *dev,
}
static DEVICE_ATTR_RO(flags);
+#define PAPR_PMEM_DIMM_CMD_MASK \
+ ((1U << PAPR_PDSM_HEALTH) \
+ | (1U << PAPR_PDSM_HEALTH_INJECT) \
+ | (1U << PAPR_PDSM_HEALTH_THRESHOLD) \
+ | (1U << PAPR_PDSM_HEALTH_THRESHOLD_SET))
+
+
+static ssize_t dsm_mask_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%#x\n", PAPR_PMEM_DIMM_CMD_MASK);
+}
+static DEVICE_ATTR_RO(dsm_mask);
+
static struct attribute *ndtest_nvdimm_attributes[] = {
&dev_attr_nvdimm_show_handle.attr,
&dev_attr_vendor.attr,
@@ -837,6 +1077,7 @@ static struct attribute *ndtest_nvdimm_attributes[] = {
&dev_attr_format.attr,
&dev_attr_format1.attr,
&dev_attr_flags.attr,
+ &dev_attr_dsm_mask.attr,
NULL,
};
@@ -856,6 +1097,7 @@ static int ndtest_dimm_register(struct ndtest_priv *priv,
{
struct device *dev = &priv->pdev.dev;
unsigned long dimm_flags = dimm->flags;
+ struct kernfs_node *papr_kernfs;
if (dimm->num_formats > 1) {
set_bit(NDD_ALIASING, &dimm_flags);
@@ -882,6 +1124,20 @@ static int ndtest_dimm_register(struct ndtest_priv *priv,
return -ENOMEM;
}
+ nd_synchronize();
+
+ papr_kernfs = sysfs_get_dirent(nvdimm_kobj(dimm->nvdimm)->sd, "papr");
+ if (!papr_kernfs) {
+ pr_err("Could not initialize the notifier handle\n");
+ return 0;
+ }
+
+ dimm->notify_handle = sysfs_get_dirent(papr_kernfs, "flags");
+ sysfs_put(papr_kernfs);
+ if (!dimm->notify_handle) {
+ pr_err("Could not initialize the notifier handle\n");
+ return 0;
+ }
return 0;
}
@@ -953,6 +1209,8 @@ static int ndtest_bus_register(struct ndtest_priv *p)
p->bus_desc.provider_name = NULL;
p->bus_desc.attr_groups = ndtest_attribute_groups;
+ set_bit(NVDIMM_FAMILY_PAPR, &p->bus_desc.dimm_family_mask);
+
p->bus = nvdimm_bus_register(&p->pdev.dev, &p->bus_desc);
if (!p->bus) {
dev_err(&p->pdev.dev, "Error creating nvdimm bus %pOF\n", p->dn);
diff --git a/tools/testing/nvdimm/test/ndtest.h b/tools/testing/nvdimm/test/ndtest.h
index 2c54c9cbb90c..d29638b6a332 100644
--- a/tools/testing/nvdimm/test/ndtest.h
+++ b/tools/testing/nvdimm/test/ndtest.h
@@ -16,6 +16,8 @@
#define PAPR_PMEM_HEALTH_FATAL (1ULL << (63 - 5))
/* SCM contents cannot persist due to current platform health status */
#define PAPR_PMEM_HEALTH_UNHEALTHY (1ULL << (63 - 6))
+/* SCM device is unable to persist memory contents in certain conditions */
+#define PAPR_PMEM_HEALTH_NON_CRITICAL (1ULL << (63 - 7))
/* Bits status indicators for health bitmap indicating unarmed dimm */
#define PAPR_PMEM_UNARMED_MASK (PAPR_PMEM_UNARMED | \
@@ -38,6 +40,49 @@
struct ndtest_config;
+/* DIMM Health extension flag bits */
+#define PDSM_DIMM_HEALTH_RUN_GAUGE_VALID (1 << 0)
+#define PDSM_DIMM_HEALTH_MEDIA_TEMPERATURE_VALID (1 << 1)
+#define PDSM_DIMM_HEALTH_CTRL_TEMPERATURE_VALID (1 << 2)
+#define PDSM_DIMM_HEALTH_SHUTDOWN_COUNT_VALID (1 << 3)
+#define PDSM_DIMM_HEALTH_SPARES_VALID (1 << 4)
+#define PDSM_DIMM_HEALTH_ALARM_VALID (1 << 5)
+
+#define PAPR_PDSM_DIMM_HEALTHY 0
+
+#define ND_PAPR_HEALTH_SPARE_TRIP (1 << 0)
+#define ND_PAPR_HEALTH_TEMP_TRIP (1 << 1)
+#define ND_PAPR_HEALTH_CTEMP_TRIP (1 << 2)
+
+/* DIMM Health inject flag bits */
+#define ND_PAPR_HEALTH_INJECT_MTEMP (1 << 0)
+#define ND_PAPR_HEALTH_INJECT_SPARE (1 << 1)
+#define ND_PAPR_HEALTH_INJECT_FATAL (1 << 2)
+#define ND_PAPR_HEALTH_INJECT_SHUTDOWN (1 << 3)
+
+/* Various nvdimm health indicators */
+#define PAPR_PDSM_DIMM_HEALTHY 0
+#define PAPR_PDSM_DIMM_UNHEALTHY 1
+#define PAPR_PDSM_DIMM_CRITICAL 2
+#define PAPR_PDSM_DIMM_FATAL 3
+
+enum papr_pdsm {
+ PAPR_PDSM_MIN = 0x0,
+ PAPR_PDSM_HEALTH,
+ PAPR_PDSM_INJECT_SET = 11,
+ PAPR_PDSM_INJECT_CLEAR = 12,
+ PAPR_PDSM_INJECT_GET = 13,
+ PAPR_PDSM_HEALTH_INJECT = 14,
+ PAPR_PDSM_HEALTH_THRESHOLD = 15,
+ PAPR_PDSM_HEALTH_THRESHOLD_SET = 16,
+ PAPR_PDSM_MAX,
+};
+
+enum dimm_type {
+ NDTEST_REGION_TYPE_PMEM = 0x0,
+ NDTEST_REGION_TYPE_BLK = 0x1,
+};
+
struct ndtest_priv {
struct platform_device pdev;
struct device_node *dn;
@@ -80,6 +125,21 @@ struct ndtest_dimm {
int id;
int fail_cmd_code;
u8 no_alias;
+
+ struct kernfs_node *notify_handle;
+
+ /* SMART Health information */
+ unsigned long long extension_flags;
+ __u16 dimm_fuel_gauge;
+ __u16 media_temperature;
+ __u16 ctrl_temperature;
+ __u8 spares;
+ __u8 alarm_flags;
+
+ /* SMART Health thresholds */
+ __u16 media_temperature_threshold;
+ __u16 ctrl_temperature_threshold;
+ __u8 spares_threshold;
};
struct ndtest_mapping {
@@ -106,4 +166,73 @@ struct ndtest_config {
u8 num_regions;
};
+#define ND_PDSM_PAYLOAD_MAX_SIZE 184
+
+struct nd_papr_pdsm_health {
+ union {
+ struct {
+ __u32 extension_flags;
+ __u8 dimm_unarmed;
+ __u8 dimm_bad_shutdown;
+ __u8 dimm_bad_restore;
+ __u8 dimm_scrubbed;
+ __u8 dimm_locked;
+ __u8 dimm_encrypted;
+ __u16 dimm_health;
+
+ /* Extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID */
+ __u16 dimm_fuel_gauge;
+ __u16 media_temperature;
+ __u16 ctrl_temperature;
+ __u8 spares;
+ __u16 alarm_flags;
+ };
+ __u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
+ };
+};
+
+struct nd_papr_pdsm_health_threshold {
+ union {
+ struct {
+ __u16 alarm_control;
+ __u8 spares;
+ __u16 media_temperature;
+ __u16 ctrl_temperature;
+ __u32 status;
+ };
+ __u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
+ };
+};
+
+struct nd_papr_pdsm_health_inject {
+ union {
+ struct {
+ __u64 flags;
+ __u8 mtemp_enable;
+ __u16 media_temperature;
+ __u8 ctemp_enable;
+ __u16 ctrl_temperature;
+ __u8 spares_enable;
+ __u8 spares;
+ __u8 fatal_enable;
+ __u8 unsafe_shutdown_enable;
+ __u32 status;
+ };
+ __u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
+ };
+};
+
+union nd_pdsm_payload {
+ struct nd_papr_pdsm_health health;
+ struct nd_papr_pdsm_health_inject inject;
+ struct nd_papr_pdsm_health_threshold threshold;
+ __u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
+} __packed;
+
+struct nd_pkg_pdsm {
+ __s32 cmd_status; /* Out: Sub-cmd status returned back */
+ __u16 reserved[2]; /* Ignored and to be set as '0' */
+ union nd_pdsm_payload payload;
+} __packed;
+
#endif /* NDTEST_H */
--
2.31.1
1 year, 1 month