Hello,
I am trying to run NVME/TCP perf experiments with spdk initiator and target similar to the
setup mentioned in the below NVME/TCP SPDK perf document.
https://ci.spdk.io/download/performance-reports/SPDK_tcp_perf_report_2101...
I am running my SPDK initiator and target on two different machines with the details at
the end of mail.
I am running SPDK initiator scaling experiment as mentioned in the document to measure
SPDK initiator performance. IO size is 4K.
I have matched all the configs mentioned in the above document for tuning TCP and enabling
zero copy on target and all fio configs mentioned in the document for initiator and
target. Only variation is I am using Null dev instead of real SSDs for running the tests.
I am able to see peak write performance close to 3M IOPS as shown in the document for the
initiator with FIO.
However for read performance I am getting capped at around 1.5 M IOPS and not able to
scale beyond that even after playing around with different values for number of
cores/num-jobs , iodepth, number of tcp connections to target/subqn-cnodes etc.
From initial debugging using perf tool, looks like there are lot of L2/L3 cache misses
(thrashing) for FIO read test when compared to FIO write. Not entirely sure if this could
be the only reason for the degraded read performance.
I was wondering if SPDK read path is touching more data per IOP and hence the increased
load on cache and higher latencies, is leading to this?
Can you please throw more light on this?
Also any tunings to help reach higher numbers similar to the perf mentioned in the
document for read NVME/TCP FIO initiator test?
Initiator machine details.
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 72
On-line CPU(s) list: 0-71
Thread(s) per core: 2
Core(s) per socket: 18
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
Stepping: 4
CPU MHz: 1000.740
CPU max MHz: 3700.0000
CPU min MHz: 1000.0000
BogoMIPS: 4600.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 25344K
NUMA node0 CPU(s): 0-17,36-53
NUMA node1 CPU(s): 18-35,54-71
Thanks,
Vishwas