On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
> Hi Everyone,
So Oliver (CC) was having issues getting any of that to work for us.
The problem is that acccording to him (I didn't double check the latest
patches) you effectively hotplug the PCIe memory into the system when
creating struct pages.
This cannot possibly work for us. First we cannot map PCIe memory as
cachable. (Note that doing so is a bad idea if you are behind a PLX
switch anyway since you'd ahve to manage cache coherency in SW).
Note: I think the above means it won't work behind a switch on x86
either, will it ?
Then our MMIO space is so far away from our memory space that there
not enough vmemmap virtual space to be able to do that.
So this can only work accross achitectures by using something like HMM
to create special device struct page's.
> Here's v2 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.16-rc3 which already
> includes Christoph's devpagemap work the previous version was based
> off as well as a couple of the cleanup patches that were in v1.
> Additionally, we've made the following changes based on feedback:
> * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
> as a bunch of cleanup and spelling fixes he pointed out in the last
> * To address Alex's ACS concerns, we change to a simpler method of
> just disabling ACS behind switches for any kernel that has
> * We also reject using devices that employ 'dma_virt_ops' which should
> fairly simply handle Jason's concerns that this work might break with
> the HFI, QIB and rxe drivers that use the virtual ops to implement
> their own special DMA operations.
> This is a continuation of our work to enable using Peer-to-Peer PCI
> memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who
> provided valuable feedback to get these patches to where they are today.
> The concept here is to use memory that's exposed on a PCI BAR as
> data buffers in the NVME target code such that data can be transferred
> from an RDMA NIC to the special memory and then directly to an NVMe
> device avoiding system memory entirely. The upside of this is better
> QoS for applications running on the CPU utilizing memory and lower
> PCI bandwidth required to the CPU (such that systems could be designed
> with fewer lanes connected to the CPU). However, presently, the
> trade-off is currently a reduction in overall throughput. (Largely due
> to hardware issues that would certainly improve in the future).
> Due to these trade-offs we've designed the system to only enable using
> the PCI memory in cases where the NIC, NVMe devices and memory are all
> behind the same PCI switch. This will mean many setups that could likely
> work well will not be supported so that we can be more confident it
> will work and not place any responsibility on the user to understand
> their topology. (We chose to go this route based on feedback we
> received at the last LSF). Future work may enable these transfers behind
> a fabric of PCI switches or perhaps using a white list of known good
> root complexes.
> In order to enable this functionality, we introduce a few new PCI
> functions such that a driver can register P2P memory with the system.
> Struct pages are created for this memory using devm_memremap_pages()
> and the PCI bus offset is stored in the corresponding pagemap structure.
> Another set of functions allow a client driver to create a list of
> client devices that will be used in a given P2P transactions and then
> use that list to find any P2P memory that is supported by all the
> client devices. This list is then also used to selectively disable the
> ACS bits for the downstream ports behind these devices.
> In the block layer, we also introduce a P2P request flag to indicate a
> given request targets P2P memory as well as a flag for a request queue
> to indicate a given queue supports targeting P2P memory. P2P requests
> will only be accepted by queues that support it. Also, P2P requests
> are marked to not be merged seeing a non-homogenous request would
> complicate the DMA mapping requirements.
> In the PCI NVMe driver, we modify the existing CMB support to utilize
> the new PCI P2P memory infrastructure and also add support for P2P
> memory in its request queue. When a P2P request is received it uses the
> pci_p2pmem_map_sg() function which applies the necessary transformation
> to get the corrent pci_bus_addr_t for the DMA transactions.
> In the RDMA core, we also adjust rdma_rw_ctx_init() and
> rdma_rw_ctx_destroy() to take a flags argument which indicates whether
> to use the PCI P2P mapping functions or not.
> Finally, in the NVMe fabrics target port we introduce a new
> configuration boolean: 'allow_p2pmem'. When set, the port will attempt
> to find P2P memory supported by the RDMA NIC and all namespaces. If
> supported memory is found, it will be used in all IO transfers. And if
> a port is using P2P memory, adding new namespaces that are not supported
> by that memory will fail.
> Logan Gunthorpe (10):
> PCI/P2PDMA: Support peer to peer memory
> PCI/P2PDMA: Add sysfs group to display p2pmem stats
> PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
> PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> block: Introduce PCI P2P flags for request and request queue
> IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
> nvme-pci: Use PCI p2pmem subsystem to manage the CMB
> nvme-pci: Add support for P2P memory in requests
> nvme-pci: Add a quirk for a pseudo CMB
> nvmet: Optionally use PCI P2P memory
> Documentation/ABI/testing/sysfs-bus-pci | 25 ++
> block/blk-core.c | 3 +
> drivers/infiniband/core/rw.c | 21 +-
> drivers/infiniband/ulp/isert/ib_isert.c | 5 +-
> drivers/infiniband/ulp/srpt/ib_srpt.c | 7 +-
> drivers/nvme/host/core.c | 4 +
> drivers/nvme/host/nvme.h | 8 +
> drivers/nvme/host/pci.c | 118 ++++--
> drivers/nvme/target/configfs.c | 29 ++
> drivers/nvme/target/core.c | 95 ++++-
> drivers/nvme/target/io-cmd.c | 3 +
> drivers/nvme/target/nvmet.h | 10 +
> drivers/nvme/target/rdma.c | 43 +-
> drivers/pci/Kconfig | 20 +
> drivers/pci/Makefile | 1 +
> drivers/pci/p2pdma.c | 713 ++++++++++++++++++++++++++++++++
> drivers/pci/pci.c | 4 +
> include/linux/blk_types.h | 18 +-
> include/linux/blkdev.h | 3 +
> include/linux/memremap.h | 19 +
> include/linux/pci-p2pdma.h | 105 +++++
> include/linux/pci.h | 4 +
> include/rdma/rw.h | 7 +-
> net/sunrpc/xprtrdma/svc_rdma_rw.c | 6 +-
> 24 files changed, 1204 insertions(+), 67 deletions(-)
> create mode 100644 drivers/pci/p2pdma.c
> create mode 100644 include/linux/pci-p2pdma.h