Hey folks, (chiming in very late here...)
>> I think, if you want to build a uAPI for notification of MR
>> break, then you need show how it fits into the above software model:
>> - How it can be hidden in a RDMA specific library
> So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status
> == IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make
> the solution generic across DAX and non-DAX. What's you're feeling for
> how well applications are prepared to deal with that status return?
Stuffing an entry into the CQ is difficult. The CQ is in user memory
and it is DMA'd from the HCA for several pieces of hardware, so the
kernel can't just stuff something in there. It can be done
with HW support by having the HCA DMA it via an exception path or
something, but even then, you run into questions like CQ overflow and
accounting issues since it is not ment for this.
But why should the kernel ever need to mangle the CQ? if a lease break
would deregister the MR the device is expected to generate remote
protection errors on its own.
And in that case, I think we need a query mechanism rather an event
mechanism so when the application starts seeing protection errors
it can query the relevant MR (I think most if not all devices have that
information in their internal completion queue entries).
So, you need a side channel of some kind, either in certain drivers or
>> - How lease break can be done hitlessly, so the library user never
>> needs to know it is happening or see failed/missed transfers
I agree that the application should not be aware of lease breakages, but
seeing failed transfers is perfectly acceptable given that an access
violation is happening (my assumption is that failed transfers are error
completions reported in the user completion queue). What we need to have
is a framework to help user-space to recover sanely, which is to query
what MR had the access violation, restore it, and re-establish the queue
> iommu redirect should be hit less and behave like the page cache case
> where RDMA targets pages that are no longer part of the file.
Yes, if the iommu can be fenced properly it sounds doable.
>> - Whatever fast path checking is needed does not kill performance
> What do you consider a fast path? I was assuming that memory
> registration is a slow path, and iommu operations are asynchronous so
> should not impact performance of ongoing operations beyond typical
> iommu overhead.
ibv_poll_cq() and ibv_post_send() would be a fast path.
Where this struggled before is in creating a side channel you also now
have to check that side channel, and checking it at high performance
is quite hard.. Even quiecing things to be able to tear down the MR
has performance implications on post send...
This is exactly why I think we should not have it, but instead give
building blocks to recover sanely from error completions...