Hi,
Looking at the NVMeF discovery specifications, I noticed that the port ID should also be
enumerated uniquely per NVMeF subsystem (NVMeF 1.4, section 1.5.2):
"Each NVM subsystem port has a 16-bit port identifier (Port ID). An NVM subsystem
port is identified by the NVM Subsystem NQN and Port ID."
Am I correct in my understanding that this should also be enumerated uniquely?
Currently the implementation is to enumerate according to the order at which the
connection appears in the discovery log page.
In a distributed application, what would be the best method of doing this? The simplest
option for me would be to add a default parameter to the "add listener" RPC
which allows the user to give the port an ID, if it isn't set then a default invalid
option can be used (e.g. 0xFFFF), and then we use the old method, otherwise use the user
defined port ID.
WDYT?
Shahar
________________________________
From: SPDK <spdk-bounces(a)lists.01.org> on behalf of Shahar Salzman
<shahar.salzman(a)kaminario.com>
Sent: Thursday, July 11, 2019 12:51 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] Sharing namespaces between subsystems
You are correct in your understanding of how a client sees both nodes as different ports
of the same subsystem.
Regarding your points bellow:
Reservations: we currently do not implement reservations for NVMeF, but for SCSI we use
our cluster management to synchronize the reservation state. I am still not sure how we
will implement this in SPDK. We will probably start dealing with this when supporting
ESX.
Subsystem synchronization issues, we have a clusterr management, which performs all of the
configuration operations via RPC, so all SPDK nodes contain the same state. We are using
RPCs' as-is, so we are not performing the quiesce in a unified manner which may lead
to a small time where they may be inconsistent (until all RPCs' return).
Discovery service is derived from the subsystem state, which is synchronized via
management, i.e. when management operation is complete, all active nodes expose the same
subsystems.
Namespace ID, we provide the same namespace ID for every exposed device. Actually, from
this perspective it is better for us to use the multiple subsystem approach, since we
already generate a world wide unique identifier for these devices, as they may be exposed
via SCSI (FC or iSCSI)
________________________________
From: SPDK <spdk-bounces(a)lists.01.org> on behalf of Walker, Benjamin
<benjamin.walker(a)intel.com>
Sent: Thursday, July 11, 2019 11:30 AM
To: Storage Performance Development Kit
Subject: Re: [SPDK] Sharing namespaces between subsystems
-----Original Message-----
From: SPDK [mailto:spdk-bounces@lists.01.org] On Behalf Of Shahar Salzman
Sent: Thursday, July 11, 2019 9:32 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: [SPDK] Sharing namespaces between subsystems
Hi,
Looking to support native NVMe multipathing on Linux, I am looking at the
specifications regarding controller IDs.
Our system is a distributed system exposing the same logical devices through
multiple physical hosts, each running its own SPDK instance.
So say you have node A and node B, and they both expose namespace '123' where
'123' is the namespace UUID. If I'm understanding your scenario correctly,
you'd like a client to see node A and node B as the same subsystem, with two valid
paths to get there. Is that correct?
Looking at the code (v19.04), I see that controller IDs are generated in the 0-
0xFFF0 range, and verify that within the subsystem they are unique before
returning the value to the controller. This method of serially generating the
controller ID means that on different nodes we will probably get the same
controller ID, which means that the host may identify a new controller as one
which already exists.
This means I need to either limit the controller ID range per spdk instance, and
remain spec aligned, or expose a different subsystem per physical host, solving
the controller ID issue, but not conforming to the spec...
The SPDK NVMe-oF target is not distributed in and of itself. To make it distributed,
you'll need to make changes to make the code correctly coordinate with all of the
nodes that compose the subsystem. Certainly selecting a unique controller identifier is
one area of coordination, but there are likely many more (anything stored in struct
spdk_nvmf_subsystem basically).
I looked at the namespace ID section in NVMe 1.4, and there doesn't seem to be
any mention of world wide uniqueness, so it seems that the correct
implementation would be to limit the controller ID range. Would an API to limit
the controller ID range in SPDK be acceptable?
Do you know of any work being done on namespace sharing between
subsystems, and on world wide unique namespace IDs?
There has been some discussion of a new NMIC bit that indicates that a namespace can be
shared across two separate subsystems (there is already a bit that says whether it can be
shared across two controllers in the same subsystem). But I confirmed that is not in the
latest specification. I think sharing a namespace across two separate subsystems is
actually a more elegant solution to the problem, so we can hope they decide to move
forward with that.
I'd be fine with an API that lets the user provide a callback to generate controller
ids for each subsystem. Then your application can set it up to work however you want. My
primary concern is that this may just be the tip of the iceberg, so I'd like to hold
off on going this route until we understand all of the different pieces of data that are
going to need coordination across the nodes. Just from a quick glance, some problematic
things are:
1) reservations (which belong to a namespace and are emulated in software. I think you
just have to disable this.)
2) discovery services (I assume you have a separate discovery service that is
cluster-aware and the one in SPDK is turned off?)
3) namespace ids (which I think we already let the user pick)
4) Subsystem state (pause/resume). I don't think SPDK has RPCs to pause and resume
subsystems directly. The pause and resume just happens automatically when you do some
other management operation like add or remove a namespace. However, in a distributed
implementation you'd need the orchestrator to pause the subsystem on all nodes, then
do the management operation, then resume on all nodes. I think you'd need additional
RPCs for this. The pause and resume functions are part of the public nvmf API, so maybe
your application is calling those and this is all fine.
Have you already thought through these sorts of problems? Is it just the controller ID
that you haven't solved?
Thanks,
Ben
Shahar
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk