Thank you very much Jim and Ben!
Maybe I have an incorrect mental image of the queue.
My understanding is that the queue depth is the capacity of the command
queue pair. Even though the queue element identifier support up to 65536,
the actual space allocated for the queue pair is a far smaller number (e.g.
128) than that to match the SSD's internal capability of parallelism
(determined by the number of independent components such as planes/dies).
The I/O requests in the queue pair are processed in a FIFO order. I/O
threads will put I/O commands in the submission queue and ring the door
bell, while the SSD will take command out of the queue pair, DMA the data,
put a completion in the completion queue, and finally ring the doorbell.
A related question for me to understand is:
for a single thread case, why does the IOPS scale as the QD increases
before the saturation?
I understand that when the SSD gets saturated, it cannot keep up with the
speed the host generate I/Os, the submission queue is always full. The IOPS
will equal to the SSD's max IOPS. The average latency will grow as the
queue depth increases, as the requests is handled one by one by the device,
the more requests are ahead in the queue, the longer the wait (or latency)
is.
However, when the SSD is not saturated, the SSD is fast enough to process
the request, i.e., to deplete the submission queue. Therefore, the
submission queue is most of the time empty, and the majority of the
space(command slots) allocated for the queue pair is wasted. So a queue
depth of 128 is equivalent to a queue depth of 1 and thus the IOPS will be
the same.
However the data does tell as the IOPS increases as the QD grows. I am
just wondering from which point I go astray.
Or, in the first place, why a large queue depth will saturate the SSD but a
small QD will not, given that the host is always generating I/O fast enough?
Thanks!
-Fenggang
On Wed, Jan 31, 2018 at 12:22 PM Walker, Benjamin <benjamin.walker(a)intel.com>
wrote:
On Wed, 2018-01-31 at 17:49 +0000, Fenggang Wu wrote:
> Hi All,
>
> I read from the SPDK doc "NVMe Driver Design -- Scaling Performance"
(here),
> which saids:
>
> " For example, if a device claims to be capable of 450,000 I/O per
second at
> queue depth 128, in practice it does not matter if the driver is using 4
queue
> pairs each with queue depth 32, or a single queue pair with queue depth
128."
>
> Does this consider the queuing latency? I am guessing the latency in the
two
> cases will be different ( in qp/qd = 4/32 and in qp/qd = 1/128). In the 4
> threads case, the latency will be 1/4 of the 1 thread case. Do I get it
right?
Officially, it is entirely up to the internal design of the device. But
for the
NVMe devices I've encountered on the market today you can use as a mental
model
a single thread inside the SSD processing incoming messages that
correspond to
doorbell writes. It simply takes the doorbell write message and does
simple math
to calculate where the command is located in host memory, and then issues
a DMA
to pull it into device local memory. It doesn't matter which queue the I/O
is on
- the math is the same. So no, the latency of 1 queue pair at 128 queue
depth is
the same as 4 queue pairs at 32 queue depth.
> If so, then I got confused as the document also says:
>
> "In order to take full advantage of this scaling, applications should
consider
> organizing their internal data structures such that data is assigned
> exclusively to a single thread."
>
> Please correct me if I get it wrong. I understand that if the dedicate
I/O
> thread has the total ownership of the I/O data structures, there is no
lock
> contention to slow down the I/O. I believe that BlobFS is also designed
in
> this philosophy in that only one thread is doing I/O.
>
> But considering the RocksDB case, if the shared data structure has
already
> been largely taken care of by the RocksDB logic via locking (which is
> inevitable anyway), the I/O requests each RocksDB thread sends to the
BlobFS
> could also has its own queue pair to do I/O. More I/O threads means
shorter
> queue depth and smaller queuing delay.
> Even if there is some FS metadata operations that may require some
locking,
> but I would guest such metadata operation takes only a small portion.
>
> Therefore, is it a viable idea to have more I/O threads in the BlobFS to
serve
> the multi-threaded RocksDB for a smaller delay? What will be the
pitfalls, or
> challenges?
You're right that RocksDB has already worked out all of its internal data
sharing using locks. It then uses a thread pool to issue simultaneous
blocking
I/O requests to the filesystem. That's where the SPDK RocksDB backend
intercepts. As you suspect, the filesystem itself (BlobFS, in this case)
has
shared data structures that must be coordinated for some operations
(creating
and deleting files, resizing files, etc. - but not regular read/write).
That's a
small part of the reason why we elected, in our first attempt at writing a
RocksDB backend, to route all I/O from each thread in the thread pool to a
single thread doing asynchronous I/O.
The main reason we route all I/O to a single thread, however, is to
minimize CPU
usage. RocksDB makes blocking calls on all threads in the thread pool. We
could
implement that in SPDK by spinning in a tight loop, polling for the I/O to
complete. But that means every thread in the RocksDB thread pool would be
burning the full core. Instead, we send all I/O to a single thread that is
polling for completions, and put the threads in the pool to sleep on a
semaphore. When an I/O completes, we send a message back to the originating
thread and kick the semaphore to wake it up. This introduces some latency
(but
the rest of SPDK is more than fast enough to compensate for that), but it
saves
a lot of CPU usage.
In an ideal world, we'd be integrating with a fully asynchronous K/V
database,
where the user could call Put() or Get() and have it return immediately
and call
a callback when the data was actually inserted. But that's just not how
RocksDB
works today. Even the background thread pool doing compaction is designed
to do
blocking operations. It would integrate with SPDK much better if it
instead had
a smaller set of threads each doing asynchronous compaction operations on a
whole set of files at once. Changing RocksDB in this way is a huge lift,
but
would be an impressive project.
>
>
>
> Any thoughts/comments are appreciated. Thank you very much!
>
> Best!
> -Fenggang
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
>
https://lists.01.org/mailman/listinfo/spdk
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk