Does BlobFS Asynchronous API support multi thread writing?
by chen.zhenghua@zte.com.cn
Hi everyone,
I simply tested the BlobFS Asynchronous API by using SPDK events framework to execute multi tasks, each task writes one file.
But it doesn't work, the spdk_file_write_async() reported an error when resizing the file size.
The call stack looks like this:
spdk_file_write_async() -> __readwrite() -> spdk_file_truncate_async() -> spdk_blob_resize()
The resize operation must be done in the metadata thread which invoked the spdk_fs_load(), so only the task dispatched to the metadata CPU core works.
That's to say only one thread can be used to write files. It's hard to use, and performance issues may arise.
Does anyone knows further more about this?
thanks very much
1 month
Best practices on driver binding for SPDK in production environments
by Lance Hartmann ORACLE
This email to the SPDK list is a follow-on to a brief discussion held during a recent SPDK community meeting (Tue Jun 26 UTC 15:00).
Lifted and edited from the Trello agenda item (https://trello.com/c/U291IBYx/91-best-practices-on-driver-binding-for-spd... <https://trello.com/c/U291IBYx/91-best-practices-on-driver-binding-for-spd...>):
During development many (most?) people rely on the run of SPDK's scripts/setup.sh to perform a number of initializations, among them the unbinding of the Linux kernel nvme driver from NVMe controllers targeted for use by the SPDK and then binding them to either uio_pci_generic or vfio-pci. This script is applicable for development environments, but not targeted for use in productions systems employing the SPDK.
I'd like to confer with my fellow SPDK community members on ideas, suggestions and best practices for handling this driver unbinding/binding. I wrote some udev rules along with updates to some other Linux system conf files for automatically loading either the uio_pci_generic or vfio-pci modules. I also had to update my initramfs so that when the system comes all the way up, the desired NVMe controllers are already bound to the needed driver for SPDK operation. And, as a bonus, it should "just work" when a hotplug occurs as well. However, there may be additional considerations I might have overlooked on which I'd appreciate input. Further, there's the matter of how and whether to semi-automate this configuration via some kind of script and how that might vary according to Linux distro to say nothing of the determination of employing uio_pci_generic vs vfio-pci.
And, now some details:
1. I performed this on an Oracle Linux (OL) distro. I’m currently unaware how and what configuration files might be different depending on the distro. Oracle Linux is RedHat-compatible, so I’m confident my implementation should run similarly on RedHat-based systems, but I’ve yet to delve into other distro’s like Debian, SuSE, etc.
2. In preparation to writing my own udev rules, I unbound a specific NVMe controller from the Linux nvme driver by hand. Then, in another window I launched: "udevadm monitor -k -p” so that I could observe the usual udev events when a NVMe controller is bound to the nvme driver. On my system, I observed four (4) udev kernel events (abbreviated/edited output to avoid this become excessively long):
(Event 1)
KERNEL[382128.187273] add /devices/pci0000:00/0000:00:02.2/0000:30:00.0/nvme/nvme0 (nvme)
ACTION=add
DEVNAME=/dev/nvme0
…
SUBSYSTEM=nvme
(Event 2)
KERNEL[382128.244658] bind /devices/pci0000:00/0000:00:02.2/0000:30:00.0 (pci)
ACTION=bind
DEVPATH=/devices/pci0000:00/0000:00:02.2/0000:30:00.0
DRIVER=nvme
…
SUBSYSTEM=pci
(Event 3)
KERNEL[382130.697832] add /devices/virtual/bdi/259:0 (bdi)
ACTION=add
DEVPATH=/devices/virtual/bdi/259:0
...
SUBSYSTEM=bdi
(Event 4)
KERNEL[382130.698192] add /devices/pci0000:00/0000:00:02.2/0000:30:00.0/nvme/nvme0/nvme0n1 (block)
ACTION=add
DEVNAME=/dev/nvme0n1
DEVPATH=/devices/pci0000:00/0000:00:02.2/0000:30:00.0/nvme/nvme0/nvme0n1
DEVTYPE=disk
...
SUBSYSTEM=block
3. My udev rule triggers on (Event 2) above: the bind action. Upon this action, my udev rule appends operations to the special udev RUN variable such that udev will essentially mirror that which is done in the SPDK’s scripts/setup.sh for unbinding from the nvme driver and binding to, in my case, the vfio-pci driver.
4. With my new udev rules in place, I was successful getting specific NVMe controllers (based on bus-device-function) to unbind from the Linux nvme driver and bind to vfio-pci. However, I made a couple of observations in the kernel log (dmesg). In particular, I was drawn to the following for an NVMe controller at BDF: 0000:40:00.0 for which I had a udev rule to unbind from nvme and bind to vfio-pci:
[ 35.534279] nvme nvme1: pci function 0000:40:00.0
[ 37.964945] nvme nvme1: failed to mark controller live
[ 37.964947] nvme nvme1: Removing after probe failure status: 0
One theory I have for the above is that my udev RUN rule was invoked while the nvme driver’s probe() was still running on this controller, and perhaps the unbind request came in before the probe() completed hence this “name1: failed to mark controller live”. This has left lingering in my mind that maybe instead of triggering on (Event 2) when the bind occurs, that perhaps I should instead try to derive a trigger on the “last" udev event, an “add”, where the NVMe namespace’s are instantiated. Of course, I’d need to know ahead of time just how many namespaces exist on that controller if I were to do that so I’d trigger on the last one. I’m wondering if that may help to avoid what looks like a complaint during the middle of probe() of that particular controller. Then, again, maybe I can just safely ignore that and not worry about it at all? Thoughts?
I discovered another issue during this experimentation that is somewhat tangential to this task, but I’ll write a separate email on that topic.
thanks for any feedback,
--
Lance Hartmann
lance.hartmann(a)oracle.com
2 years, 6 months
Chandler Build Pool Test Failures
by Howell, Seth
Hi all,
There has been a rash of failures on the test pool starting last night. I was able to root cause the failures to a point in the NVMe-oF shutdown tests. The main substance of the failure is that QAT and the DPDK framework don't always play well with secondary dpdk processes. In the interest of avoiding these failures on future builds, please rebase your changes on the following patch series which includes the fix of not running bdevperf as a secondary process in the NVMe-oF shutdown tests.
https://review.gerrithub.io/c/spdk/spdk/+/435937/6
Thanks,
Seth Howell
2 years, 7 months
Topic from last week's community meeting
by Luse, Paul E
Hi Shuhei,
I was out of town last week and missed the meeting but saw on Trello you had the topic below:
"a few idea: log structured data store , data store with compression, and metadata replication of Blobstore"
Which I'd be pretty interested in working on with you or at least hearing more about it. When you get a chance, no hurry, can you please expand a little on how the conversation went and what you're looking at specifically?
Thanks!
Paul
2 years, 7 months
Add py-spdk client for SPDK
by We We
Hi, all
I have submitted the py-spdk code on https://review.gerrithub.io/#/c/379741/, please take some time to visit it, I will be very grateful to you.
The py-spdk is client which can help the upper-level app to communicate with the SPDK-based app (such as: nvmf_tgt, vhost, iscsi_tgt, etc.). Should I submit it into the other repo I rebuild rather than SPDK repo? Because I think it is a relatively independent kit upon the SPDK.
If you have some thoughts about the py-spdk, please share with me.
Regards,
Helloway
2 years, 7 months
nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions
by JD Zheng
Hello,
When I run nvmf_tgt over RDMA using latest SPDK code, I occasionally ran
into this errors:
"rdma.c:1505:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split over
multiple RDMA Memory Regions"
After digging into the code, I found that nvmf_rdma_fill_buffers()
calls spdk_mem_map_translate() to check if a data buffer sit on 2 2MB
pages, and if it is the case, it reports this error.
The following commit added change to use data buffer start address to
calculate the size between buffer start address and 2MB boundary. The
caller nvmf_rdma_fill_buffers() uses the size to compare with IO Unit
size (which is 8KB in my conf) to determine if the buffer passes 2MB
boundary.
commit 37b7a308941b996f0e69049358a6119ed90d70a2
Author: Darek Stojaczyk <dariusz.stojaczyk(a)intel.com>
Date: Tue Nov 13 17:43:46 2018 +0100
memory: fix contiguous memory calculation for unaligned buffers
In nvmf_tgt, the buffers are pre-allocated as a memory pool and new
request will use free buffer from that pool and the buffer start address
is passed to nvmf_rdma_fill_buffers(). But I found that these buffers
are not 2MB aligned and not IOUnitSize aligned (8KB in my case) either,
instead, they are 64Byte aligned so that some buffers will fail the
checking and leads to this problem.
The corresponding code snippets are as following:
spdk_nvmf_transport_create()
{
...
transport->data_buf_pool = pdk_mempool_create(spdk_mempool_name,
opts->num_shared_buffers,
opts->io_unit_size +
NVMF_DATA_BUFFER_ALIGNMENT,
SPDK_MEMPOOL_DEFAULT_CACHE_SIZE,
SPDK_ENV_SOCKET_ID_ANY);
...
}
Also some debug print I added shows the start address of the buffers:
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
0x200019258800 0(32)
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
0x2000192557c0 1(32)
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
0x200019252780 2(32)
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
0x20001924f740 3(32)
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
0x20001924c700 4(32)
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
0x2000192496c0 5(32)
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
0x200019246680 6(32)
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
0x200019243640 7(32)
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
0x200019240600 8(32)
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
0x20001923d5c0 9(32)
...
It looks like either the buffer allocation has alignment issue or the
checking is not correct.
Please advice how to fix this problem.
Thanks,
JD Zheng
2 years, 9 months
Sharing namespaces between subsystems
by Shahar Salzman
Hi,
Looking to support native NVMe multipathing on Linux, I am looking at the specifications regarding controller IDs.
Our system is a distributed system exposing the same logical devices through multiple physical hosts, each running its own SPDK instance.
Looking at the code (v19.04), I see that controller IDs are generated in the 0-0xFFF0 range, and verify that within the subsystem they are unique before returning the value to the controller. This method of serially generating the controller ID means that on different nodes we will probably get the same controller ID, which means that the host may identify a new controller as one which already exists.
This means I need to either limit the controller ID range per spdk instance, and remain spec aligned, or expose a different subsystem per physical host, solving the controller ID issue, but not conforming to the spec...
I looked at the namespace ID section in NVMe 1.4, and there doesn't seem to be any mention of world wide uniqueness, so it seems that the correct implementation would be to limit the controller ID range. Would an API to limit the controller ID range in SPDK be acceptable?
Do you know of any work being done on namespace sharing between subsystems, and on world wide unique namespace IDs?
Shahar
2 years, 9 months
replay trace with bdev fio plugin
by Chang, Chun-kai
Hi all,
Does the bdev fio plugin support replaying fio trace with the --read_iolog flag?
I encountered the following runtime error when using this feature:
fio-3.3
Starting 1 thread
fio: pid=9537, err=12/file:memory.c:333, func=iomem allocation, error=Cannot allocate memory
Segmentation fault
The trace file I tried to replay was generated by running the fio plugin with the --write_iolog flag and with <spdk dir>/examples/bdev/fio_plugin/example_config.fio
The target bdev is a NVMe drive which is specified in bdev.conf.in as following:
[Nvme]
TransportID "trtype:PCIe traddr:0000:0b:00.0" Nvme0
Trace replay works if I directly use fio v3.3 without the plugin.
I wonder if this is a limitation of the plugin. If so, how can I modify it to enable this feature?
Thank you,
Chun-Kai
2 years, 9 months
[Release] v19.07: NVMe-oF FC Transport, VMD, NVMe-oF Persistent reservations, Bdev I/O with separate metadata
by Zawadzki, Tomasz
On behalf of the SPDK community I'm pleased to announce the release of SPDK 19.07!
This release contains the following features:
- NVMe-oF FC Transport: A Fibre Channel transport that supports Broadcom HBAs has been added. This feature should be considered experimental.
- VMD: Added Intel Volume Management Device (VMD) driver. VMD is an integrated controller inside the CPU PCIe root complex. It enables virtual HBAs for the connected NVMe SSDs. This feature should be considered experimental.
- NVMe-oF Persistent reservation: Persistent reservation emulation has been added to the NVMe-oF target. Persistent reservation state is stored in a JSON file on the local filesystem between target restart.
- Bdev: Added bdev layer functions allowing for I/O with metadata being transferred in separate buffer.
- Compression bdev: This feature should no longer be considered experimental.
- VPP: Added support for VPP 19.04.2. Socket abstraction layer now uses VPP session API.
- OCF: Added support for OCF 19.3.2. Added Write-Back mode support, persistent cache metadata and caching multiple devices on single cache device.
- DPDK: Added support for DPDK 19.05. By default, SPDK will now rely on upstream DPDK's rte_vhost instead of its fork located inside SPDK repo.
- SCSI: A security vulnerability has been identified and fixed in SPDK Vhost-SCSI and iSCSI targets. A malicious client (e.g. a virtual machine or an iSCSI initiator) could send a carefully prepared, invalid I/O request to crash the entire SPDK process. All users of SPDK vhost and iSCSI targets are strongly recommended to update. All SPDK versions < 19.07 are affected.
The full changelog for this release is available at:
https://github.com/spdk/spdk/releases/tag/v19.07
This release contains 1192 commits from 56 authors with over 61k lines of code changed.
We'd especially like to say thank you to all of our first time contributors:
Amelia Blachuciak
Mike Carlin
Or Gerlitz
Jesse Grodman
Jacek Kalwas
Mateusz Kozlowski
Alexey Marchuk
Subash Rajaa
Orden Smith
Anil Veerabhadrappa
Maciej Wawryk
Huiming Xie
Tianyu Yang
Niu Yawei
Thanks to everyone for your contributions, participation, and effort!
Thanks,
Tomek
2 years, 9 months