On Mon 08-02-21 10:49:17, Mike Rapoport wrote:
From: Mike Rapoport <rppt(a)linux.ibm.com>
Introduce "memfd_secret" system call with the ability to create memory
areas visible only in the context of the owning process and not mapped not
only to other processes but in the kernel page tables as well.
The secretmem feature is off by default and the user must explicitly enable
it at the boot time.
Once secretmem is enabled, the user will be able to create a file
descriptor using the memfd_secret() system call. The memory areas created
by mmap() calls from this file descriptor will be unmapped from the kernel
direct map and they will be only mapped in the page table of the owning mm.
Is this really true? I guess you meant to say that the memory will
visible only via page tables to anybody who can mmap the respective file
descriptor. There is nothing like an owning mm as the fd is inherently a
shareable resource and the ownership becomes a very vague and hard to
The file descriptor based memory has several advantages over the
"traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
paves the way for VMMs to remove the secret memory range from the process;
I do not understand how it helps to remove the memory from the process
as the interface explicitly allows to add a memory that is removed from
all other processes via direct map.
there may be situations where sharing is useful and file descriptor
approach allows to seal the operations.
It would be great to expand on this some more.
As secret memory implementation is not an extension of tmpfs or
usage of a dedicated system call rather than hooking new functionality into
memfd_create(2) emphasises that memfd_secret(2) has different semantics and
allows better upwards compatibility.
What is this supposed to mean? What are differences?
The secret memory remains accessible in the process context using
primitives, but it is not exposed to the kernel otherwise; secret memory
areas are removed from the direct map and functions in the
follow_page()/get_user_page() family will refuse to return a page that
belongs to the secret memory area.
Once there will be a use case that will require exposing secretmem to the
kernel it will be an opt-in request in the system call flags so that user
would have to decide what data can be exposed to the kernel.
Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which affects
the system performance. However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report  showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice". Hence, it is sufficient to have
secretmem disabled by default with the ability of a system administrator to
enable it at boot time.
OK, this looks like a reasonable compromise for the initial
implementation. Documentation of the command line parameter should be
very explicit about this though.
The secretmem mappings are locked in memory so they cannot exceed
RLIMIT_MEMLOCK. Since these mappings are already locked an attempt to
mlock() secretmem range would fail and mlockall() will ignore secretmem
What about munlock?
Pages in the secretmem regions are unevictable and unmovable to
accidental exposure of the sensitive data via swap or during page
A page that was a part of the secret memory area is cleared when it is
freed to ensure the data is not exposed to the next user of that page.
The following example demonstrates creation of a secret mapping (error
handling is omitted):
fd = memfd_secret(0);
ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);
Please also list usecases which you are aware of as well.
I am also missing some more information about the implementation. E.g.
does this memory live on an unevictable LRU and therefore participates
into stats. What about memcg accounting. What is the cross fork (CoW)/exec
behavior. How is the memory reflected in OOM situation? Is a shared
Anyway, thanks for improving the changelog. This is definitely much more
I have only glanced through the implementation and it looks sane. I will
have a closer look later but this should be pretty simple with the