On Fri, Oct 13, 2017 at 10:31 AM, Jason Gunthorpe
<jgunthorpe(a)obsidianresearch.com> wrote:
On Fri, Oct 13, 2017 at 10:01:04AM -0700, Dan Williams wrote:
> On Fri, Oct 13, 2017 at 9:38 AM, Jason Gunthorpe
> <jgunthorpe(a)obsidianresearch.com> wrote:
> > On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote:
> >
> >> scheme specific to RDMA which seems like a waste to me when we can
> >> generically signal an event on the fd for any event that effects any
> >> of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
> >> so as far as I can see delaying the notification until MR-init is too
> >> late, too granular, and too RDMA specific.
> >
> > But for RDMA a FD is not what we care about - we want the MR handle so
> > the app knows which MR needs fixing.
>
> I'd rather put the onus on userspace to remember where it used a
> MAP_DIRECT mapping and be aware that all the mappings of that file are
> subject to a lease break. Sure, we could build up a pile of kernel
> infrastructure to notify on a per-MR basis, but I think that would
> only be worth it if leases were range based. As it is, the entire file
> is covered by a lease instance and all MRs that might reference that
> file get one notification. That said, we can always arrange for a
> per-driver callback at lease-break time so that it can do something
> above and beyond the default notification.
I don't think that really represents how lots of apps actually use
RDMA.
RDMA is often buried down in the software stack (eg in a MPI), and by
the time a mapping gets used for RDMA transfer the link between the
FD, mmap and the MR is totally opaque.
Having a MR specific notification means the low level RDMA libraries
have a chance to deal with everything for the app.
Eg consider a HPC app using MPI that uses some DAX aware library to
get DAX backed mmap's. It then passes memory in those mmaps to the
MPI library to do transfers. The MPI creates the MR on demand.
So, who should be responsible for MR coherency? Today we say the MPI
is responsible. But we can't really expect the MPI
to hook SIGIO and somehow try to reverse engineer what MRs are
impacted from a FD that may not even still be open.
Ok, that's good insight that I didn't have. Userspace needs more help
than just an fd notification.
I think, if you want to build a uAPI for notification of MR lease
break, then you need show how it fits into the above software model:
- How it can be hidden in a RDMA specific library
So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status
== IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make
the solution generic across DAX and non-DAX. What's you're feeling for
how well applications are prepared to deal with that status return?
- How lease break can be done hitlessly, so the library user never
needs to know it is happening or see failed/missed transfers
iommu redirect should be hit less and behave like the page cache case
where RDMA targets pages that are no longer part of the file.
- Whatever fast path checking is needed does not kill performance
What do you consider a fast path? I was assuming that memory
registration is a slow path, and iommu operations are asynchronous so
should not impact performance of ongoing operations beyond typical
iommu overhead.