On Mon, Oct 19, 2020 at 3:30 PM Nabeel Meeramohideen Mohamed
(nmeeramohide) <nmeeramohide(a)micron.com> wrote:
On Friday, October 16, 2020 4:12 PM, Dan Williams <dan.j.williams(a)intel.com>
> On Fri, Oct 16, 2020 at 2:59 PM Nabeel Meeramohideen Mohamed
> (nmeeramohide) <nmeeramohide(a)micron.com> wrote:
> > On Thursday, October 15, 2020 2:03 AM, Christoph Hellwig
> <hch(a)infradead.org> wrote:
> > > I don't think this belongs into the kernel. It is a classic case for
> > > infrastructure that should be built in userspace. If anything is
> > > missing to implement it in userspace with equivalent performance we
> > > need to improve out interfaces, although io_uring should cover pretty
> > > much everything you need.
> > Hi Christoph,
> > We previously considered moving the mpool object store code to user-space.
> > However, by implementing mpool as a device driver, we get several benefits
> > in terms of scalability, performance, and functionality. In doing so, we
> > only on standard interfaces and did not make any changes to the kernel.
> > (1) mpool's "mcache map" facility allows us to memory-map (and
> > a collection of logically related objects with a single system call. The
> > such a collection are created at different times, physically disparate, and
> > even reside on different media class volumes.
> > For our HSE storage engine application, there are commonly 10's to
> > objects in a given mcache map, and 75,000 total objects mapped at a given
> > Compared to memory-mapping objects individually, the mcache map facility
> > scales well because it requires only a single system call and single
> > to memory-map a complete collection of objects.
> Why can't that be a batch of mmap calls on io_uring?
Agreed, we could add the capability to invoke mmap via io_uring to help mitigate the
system call overhead of memory-mapping individual objects, versus our mache map
mechanism. However, there is still the scalability issue of having a vm_area_struct
for each object (versus one for each mache map).
We ran YCSB workload C in two different configurations -
Config 1: memory-mapping each individual object
Config 2: memory-mapping a collection of related objects using mcache map
- Config 1 incurred ~3.3x additional kernel memory for the vm_area_struct slab -
24.8 MB (127188 objects) for config 1, versus 7.3 MB (37482 objects) for config 2.
- Workload C exhibited around 10-25% better tail latencies (4-nines) for config 2,
not sure if it's due the reduced complexity of searching VMAs during page faults.
So this gets to the meta question that is giving me pause on this
What does Linux get from merging mpool?
What you have above is a decent scalability bug report. That type of
pressure to meet new workload needs is how Linux interfaces evolve.
However, rather than evolve those interfaces mpool is a revolutionary
replacement that leaves the bugs intact for everyone that does not
switch over to mpool.
Consider io_uring as an example where the kernel resisted trends
towards userspace I/O engines and instead evolved a solution that
maintained kernel control while also achieving similar performance
The exercise is useful to identify places where Linux has
deficiencies, but wholesale replacing an entire I/O submission model
is a direction that leaves the old apis to rot.