On Mon, Feb 08, 2021 at 11:49:22AM +0100, Michal Hocko wrote:
On Mon 08-02-21 10:49:17, Mike Rapoport wrote:
> From: Mike Rapoport <rppt(a)linux.ibm.com>
>
> Introduce "memfd_secret" system call with the ability to create memory
> areas visible only in the context of the owning process and not mapped not
> only to other processes but in the kernel page tables as well.
>
> The secretmem feature is off by default and the user must explicitly enable
> it at the boot time.
>
> Once secretmem is enabled, the user will be able to create a file
> descriptor using the memfd_secret() system call. The memory areas created
> by mmap() calls from this file descriptor will be unmapped from the kernel
> direct map and they will be only mapped in the page table of the owning mm.
Is this really true? I guess you meant to say that the memory will
visible only via page tables to anybody who can mmap the respective file
descriptor. There is nothing like an owning mm as the fd is inherently a
shareable resource and the ownership becomes a very vague and hard to
define term.
Hmm, it seems I've been dragging this paragraph from the very first
mmap(MAP_EXCLUSIVE) rfc and nobody (including myself) noticed the
inconsistency.
> The file descriptor based memory has several advantages over
the
> "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
> paves the way for VMMs to remove the secret memory range from the process;
I do not understand how it helps to remove the memory from the process
as the interface explicitly allows to add a memory that is removed from
all other processes via direct map.
The current implementation does not help to remove the memory from the
process, but using fd-backed memory seems a better interface to remove
guest memory from host mappings than mmap. As Andy nicely put it:
"Getting fd-backed memory into a guest will take some possibly major work in
the kernel, but getting vma-backed memory into a guest without mapping it
in the host user address space seems much, much worse."
> As secret memory implementation is not an extension of tmpfs or
hugetlbfs,
> usage of a dedicated system call rather than hooking new functionality into
> memfd_create(2) emphasises that memfd_secret(2) has different semantics and
> allows better upwards compatibility.
What is this supposed to mean? What are differences?
Well, the phrasing could be better indeed. That supposed to mean that
they differ in the semantics behind the file descriptor: memfd_create
implements sealing for shmem and hugetlbfs while memfd_secret implements
memory hidden from the kernel.
> The secretmem mappings are locked in memory so they cannot
exceed
> RLIMIT_MEMLOCK. Since these mappings are already locked an attempt to
> mlock() secretmem range would fail and mlockall() will ignore secretmem
> mappings.
What about munlock?
Isn't this implied? ;-)
I'll add a sentence about it.
--
Sincerely yours,
Mike.