Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
by Stephen Bates
> No, locality matters. If you have a bunch of NICs and bunch of drives
> and the allocator chooses to put all P2P memory on a single drive your
> performance will suck horribly even if all the traffic is offloaded.
Sagi brought this up earlier in his comments about the _find_ function. We are planning to do something about this in the next version. This might be a randomization or a "user-pick" and include a rule around using the p2p_dev on the EP if that EP is part of the transaction.
Stephen
4 years, 3 months
Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
by Logan Gunthorpe
On 01/03/18 04:20 PM, Jason Gunthorpe wrote:
> On Thu, Mar 01, 2018 at 11:00:51PM +0000, Stephen Bates wrote:
> No, locality matters. If you have a bunch of NICs and bunch of drives
> and the allocator chooses to put all P2P memory on a single drive your
> performance will suck horribly even if all the traffic is offloaded.
>
> Performance will suck if you have speed differences between the PCI-E
> devices. Eg a bunch of Gen 3 x8 NVMe cards paired with a Gen 4 x16 NIC
> will not reach full performance unless the p2p buffers are properly
> balanced between all cards.
This would be solved better by choosing the closest devices in the
hierarchy in the p2pmem_find function (ie. preferring devices involved
in the transaction). Also, based on the current code, it's a bit of a
moot point seeing it requires all devices to be on the same switch. So
unless you have a giant switch hidden somewhere you're not going to get
too many NIC/NVME pairs.
In any case, we are looking at changing both those things so distance in
the topology is preferred and the ability to span multiple switches is
allowed. We're also discussing changing it to "user picks" which
simplifies all of this.
Logan
4 years, 3 months
Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
by Logan Gunthorpe
On 01/03/18 03:45 PM, Jason Gunthorpe wrote:
> I can appreciate you might have some special use case for that, but it
> absolutely should require special configuration and not just magically
> happen.
Well if driver doesn't want someone doing p2p transfers with the memory
it shouldn't publish it to be used for exactly that purpose.
> You bring up IB adaptor memory - if we put that into the allocator
> then what is to stop the NMVe driver just using it instead of the CMB
> buffer? That would be totally wrong in almost all cases.
If you mean for SQEs in the NVMe driver, look at the code, it
specifically allocates it from it's own device. If you mean the
nvmet-rdma then it's using that memory exactly as it was meant to.
Again, if the IB driver doesn't want someone to use that memory for P2P
transfers it shouldn't publish it as such.
> Seems like a very subtle and hard to debug performance trap to leave
> for the users, and pretty much the only reason to use P2P is
> performance... So why have such a dangerous interface?
It's not at all dangerous, the code specifically only uses P2P memory
that's local enough. And the majority of the code is there to make sure
it will all work in all cases.
Honestly, though, I'd love to go back to the case where the user selects
which p2pmem device to use, but that was very unpopular last year. It
would simplify a bunch of things though.
Also, no, the reason to use P2P is not performance. Only if you have
very specific hardware can you get a performance bump and it isn't all
that significant. The reason to use P2P is so you can design performant
systems with small CPUs, less or slower DRAM, and low lane counts to the
CPU, etc.
Logan
4 years, 3 months
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
by Logan Gunthorpe
On 01/03/18 01:53 PM, Jason Gunthorpe wrote:
> On Fri, Mar 02, 2018 at 07:40:15AM +1100, Benjamin Herrenschmidt wrote:
>> Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM
>> on ppc64 (maybe via an arch hook as it might depend on the processor
>> family). Server powerpc cannot do cachable accesses on IO memory
>> (unless it's special OpenCAPI or nVlink, but not on PCIe).
>
> I think you are right on this - even on x86 we must not create
> cachable mappings of PCI BARs - there is no way that works the way
> anyone would expect.
On x86, even if I try to make a cachable mapping of a PCI BAR it always
ends up being un-cached. The arch code in x86 always does the right
thing here.... Other arches, not so much.
Logan
4 years, 3 months
Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
by Logan Gunthorpe
On 01/03/18 11:42 AM, Jason Gunthorpe wrote:
> On Thu, Mar 01, 2018 at 08:35:55PM +0200, Sagi Grimberg wrote:
> This is also why I don't entirely understand why this series has a
> generic allocator for p2p mem, it makes little sense to me.
> Why wouldn't the nmve driver just claim the entire CMB of its local
> device for its own use? Why involve p2p core code in this?
We'd prefer to have a generic way to get p2pmem instead of restricting
ourselves to only using CMBs. We did work in the past where the P2P
memory was part of an IB adapter and not the NVMe card. So this won't
work if it's an NVMe only interface.
As Stephen mentioned, we also use a couple devices that only exposes P2P
memory and this isn't related to NVMe at all. So there's value in having
a generic interface and allocator to enable all devices to provide this
memory.
If there were a hypothetical situation where a driver wants to use some
of the memory for P2P and some of it for other purposes then they'd just
divide it themselves and only pass a subset to
pci_p2pdma_add_resource(). However, as per our changes to the nvme-pci
code, it's really just easier to use the allocator for everything inside
the driver.
Logan
4 years, 3 months
Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
by Stephen Bates
> I agree, I don't think this series should target anything other than
> using p2p memory located in one of the devices expected to participate
> in the p2p trasnaction for a first pass..
I disagree. There is definitely interest in using a NVMe CMB as a bounce buffer and in deploying systems where only some of the NVMe SSDs below a switch have a CMB but use P2P to access all of them. Also there are some devices that only expose memory and their entire purpose is to act as a p2p device, supporting these devices would be valuable.
> locality is super important for p2p, so I don't think things should
> start out in a way that makes specifying the desired locality hard.
Ensuring that the EPs engaged in p2p are all directly connected to the same PCIe switch ensures locality and (for the switches we have tested) performance. I agree solving the case where the namespace are CMB are on the same PCIe EP is valuable but I don't see it as critical to initial acceptance of the series.
Stephen
4 years, 3 months
[PATCH v4 0/3] minimal DAX support for XFS realtime device
by Dave Jiang
Darrick,
After reading the comments from you, Dave Chinner, and Dan, it looks like
the dyanmic S_DAX flag support won't be coming or not any time soon at the
least. Here are the the collection of patches so far to address yours and
Dave C's comments for minimal support. Please let me know what else I am
missing. Thanks!
v3:
- Removed setting of error return in ext2 and ext4 per Ross's comments
- Rebased against 4.16-rc1 with updates
---
Darrick J. Wong (1):
fs: allow per-device dax status checking for filesystems
Dave Jiang (2):
dax: change bdev_dax_supported() to support boolean returns
xfs: reject removal of realtime flag when datadev doesn't support DAX
drivers/dax/super.c | 27 ++++++++++++++-------------
fs/ext2/super.c | 3 +--
fs/ext4/super.c | 3 +--
fs/xfs/xfs_ioctl.c | 18 +++++++++++++++++-
fs/xfs/xfs_iops.c | 30 +++++++++++++++++++++++++-----
fs/xfs/xfs_super.c | 11 +++++++++--
include/linux/dax.h | 18 +++++++++++++-----
7 files changed, 80 insertions(+), 30 deletions(-)
--
4 years, 3 months