[RFC 0/2] virtio-pmem: Asynchronous flush
by Pankaj Gupta
Jeff reported preflush order issue with the existing implementation
of virtio pmem preflush. Dan suggested[1] to implement asynchronous flush
for virtio pmem using work queue as done in md/RAID. This patch series
intends to solve the preflush ordering issue and also makes the flush
asynchronous from the submitting thread POV.
Submitting this patch series for feeback and is in WIP. I have
done basic testing and currently doing more testing.
Pankaj Gupta (2):
pmem: make nvdimm_flush asynchronous
virtio_pmem: Async virtio-pmem flush
drivers/nvdimm/nd_virtio.c | 66 ++++++++++++++++++++++++++----------
drivers/nvdimm/pmem.c | 15 ++++----
drivers/nvdimm/region_devs.c | 3 +-
drivers/nvdimm/virtio_pmem.c | 9 +++++
drivers/nvdimm/virtio_pmem.h | 12 +++++++
5 files changed, 78 insertions(+), 27 deletions(-)
[1] https://marc.info/?l=linux-kernel&m=157446316409937&w=2
--
2.20.1
1 year, 3 months
[PATCH v2] nvdimm: Avoid race between probe and reading device attributes
by Richard Palethorpe
It is possible to cause a division error and use-after-free by querying the
nmem device before the driver data is fully initialised in nvdimm_probe. E.g
by doing
(while true; do
cat /sys/bus/nd/devices/nmem*/available_slots 2>&1 > /dev/null
done) &
while true; do
for i in $(seq 0 4); do
echo nmem$i > /sys/bus/nd/drivers/nvdimm/bind
done
for i in $(seq 0 4); do
echo nmem$i > /sys/bus/nd/drivers/nvdimm/unbind
done
done
On 5.7-rc3 this causes:
[ 12.711578] divide error: 0000 [#1] SMP KASAN PTI
[ 12.712321] CPU: 0 PID: 231 Comm: cat Not tainted 5.7.0-rc3 #48
[ 12.713188] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[ 12.714857] RIP: 0010:nd_label_nfree+0x134/0x1a0 [libnvdimm]
[ 12.715772] Code: ba 00 00 00 00 00 fc ff df 48 89 f9 48 c1 e9 03 0f b6 14 11 84 d2 74 05 80 fa 03 7e 52 8b 73 08 31 d2 89 c1 48 83 c4 08 5b 5d <f7> f6 31 d2 41 5c 83 c0 07 c1 e8 03 48 8d 84 00 8e 02 00 00 25 00
[ 12.718311] RSP: 0018:ffffc9000046fd08 EFLAGS: 00010282
[ 12.719030] RAX: 0000000000000000 RBX: ffffffffc0073aa0 RCX: 0000000000000000
[ 12.720005] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888060931808
[ 12.720970] RBP: ffff88806609d018 R08: 0000000000000001 R09: ffffed100cc0a2b1
[ 12.721889] R10: ffff888066051587 R11: ffffed100cc0a2b0 R12: ffff888060931800
[ 12.722744] R13: ffff888064362000 R14: ffff88806609d018 R15: ffffffff8b1a2520
[ 12.723602] FS: 00007fd16f3d5580(0000) GS:ffff88806b400000(0000) knlGS:0000000000000000
[ 12.724600] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 12.725308] CR2: 00007fd16f1ec000 CR3: 0000000064322006 CR4: 0000000000160ef0
[ 12.726268] Call Trace:
[ 12.726633] available_slots_show+0x4e/0x120 [libnvdimm]
[ 12.727380] dev_attr_show+0x42/0x80
[ 12.727891] ? memset+0x20/0x40
[ 12.728341] sysfs_kf_seq_show+0x218/0x410
[ 12.728923] seq_read+0x389/0xe10
[ 12.729415] vfs_read+0x101/0x2d0
[ 12.729891] ksys_read+0xf9/0x1d0
[ 12.730361] ? kernel_write+0x120/0x120
[ 12.730915] do_syscall_64+0x95/0x4a0
[ 12.731435] entry_SYSCALL_64_after_hwframe+0x49/0xb3
[ 12.732163] RIP: 0033:0x7fd16f2fe4be
[ 12.732685] Code: c0 e9 c6 fe ff ff 50 48 8d 3d 2e 12 0a 00 e8 69 e9 01 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
[ 12.735207] RSP: 002b:00007ffd3177b838 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 12.736261] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fd16f2fe4be
[ 12.737233] RDX: 0000000000020000 RSI: 00007fd16f1ed000 RDI: 0000000000000003
[ 12.738203] RBP: 00007fd16f1ed000 R08: 00007fd16f1ec010 R09: 0000000000000000
[ 12.739172] R10: 00007fd16f3f4f70 R11: 0000000000000246 R12: 00007ffd3177ce23
[ 12.740144] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
[ 12.741139] Modules linked in: nfit libnvdimm
[ 12.741783] ---[ end trace 99532e4b82410044 ]---
[ 12.742452] RIP: 0010:nd_label_nfree+0x134/0x1a0 [libnvdimm]
[ 12.743167] Code: ba 00 00 00 00 00 fc ff df 48 89 f9 48 c1 e9 03 0f b6 14 11 84 d2 74 05 80 fa 03 7e 52 8b 73 08 31 d2 89 c1 48 83 c4 08 5b 5d <f7> f6 31 d2 41 5c 83 c0 07 c1 e8 03 48 8d 84 00 8e 02 00 00 25 00
[ 12.745709] RSP: 0018:ffffc9000046fd08 EFLAGS: 00010282
[ 12.746340] RAX: 0000000000000000 RBX: ffffffffc0073aa0 RCX: 0000000000000000
[ 12.747209] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888060931808
[ 12.748081] RBP: ffff88806609d018 R08: 0000000000000001 R09: ffffed100cc0a2b1
[ 12.748977] R10: ffff888066051587 R11: ffffed100cc0a2b0 R12: ffff888060931800
[ 12.749849] R13: ffff888064362000 R14: ffff88806609d018 R15: ffffffff8b1a2520
[ 12.750729] FS: 00007fd16f3d5580(0000) GS:ffff88806b400000(0000) knlGS:0000000000000000
[ 12.751708] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 12.752441] CR2: 00007fd16f1ec000 CR3: 0000000064322006 CR4: 0000000000160ef0
[ 12.821357] ==================================================================
[ 12.822284] BUG: KASAN: use-after-free in __mutex_lock+0x111c/0x11a0
[ 12.823084] Read of size 4 at addr ffff888065c26238 by task reproducer/218
[ 12.823968]
[ 12.824183] CPU: 2 PID: 218 Comm: reproducer Tainted: G D 5.7.0-rc3 #48
[ 12.825167] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[ 12.826595] Call Trace:
[ 12.826926] dump_stack+0x97/0xe0
[ 12.827362] print_address_description.constprop.0+0x1b/0x210
[ 12.828111] ? __mutex_lock+0x111c/0x11a0
[ 12.828645] __kasan_report.cold+0x37/0x92
[ 12.829179] ? __mutex_lock+0x111c/0x11a0
[ 12.829706] kasan_report+0x38/0x50
[ 12.830158] __mutex_lock+0x111c/0x11a0
[ 12.830666] ? ftrace_graph_stop+0x10/0x10
[ 12.831193] ? is_nvdimm_bus+0x40/0x40 [libnvdimm]
[ 12.831820] ? mutex_trylock+0x2b0/0x2b0
[ 12.832333] ? nvdimm_probe+0x259/0x420 [libnvdimm]
[ 12.832975] ? mutex_trylock+0x2b0/0x2b0
[ 12.833500] ? nvdimm_probe+0x259/0x420 [libnvdimm]
[ 12.834122] ? prepare_ftrace_return+0xa1/0xf0
[ 12.834724] ? ftrace_graph_caller+0x6b/0xa0
[ 12.835269] ? acpi_label_write+0x390/0x390 [nfit]
[ 12.835909] ? nvdimm_probe+0x259/0x420 [libnvdimm]
[ 12.836558] ? nvdimm_probe+0x259/0x420 [libnvdimm]
[ 12.837179] nvdimm_probe+0x259/0x420 [libnvdimm]
[ 12.837802] nvdimm_bus_probe+0x110/0x6b0 [libnvdimm]
[ 12.838470] really_probe+0x212/0x9a0
[ 12.838954] driver_probe_device+0x1cd/0x300
[ 12.839511] ? driver_probe_device+0x5/0x300
[ 12.840063] device_driver_attach+0xe7/0x120
[ 12.840623] bind_store+0x18d/0x230
[ 12.841075] kernfs_fop_write+0x200/0x420
[ 12.841606] vfs_write+0x154/0x450
[ 12.842047] ksys_write+0xf9/0x1d0
[ 12.842497] ? __ia32_sys_read+0xb0/0xb0
[ 12.843010] do_syscall_64+0x95/0x4a0
[ 12.843495] entry_SYSCALL_64_after_hwframe+0x49/0xb3
[ 12.844140] RIP: 0033:0x7f5b235d3563
[ 12.844607] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 48 83 ec 28 48 89 54 24 18
[ 12.846877] RSP: 002b:00007fff1c3bc578 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 12.847822] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f5b235d3563
[ 12.848717] RDX: 0000000000000006 RSI: 000055f9576710d0 RDI: 0000000000000001
[ 12.849594] RBP: 000055f9576710d0 R08: 000000000000000a R09: 0000000000000000
[ 12.850470] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000006
[ 12.851333] R13: 00007f5b236a3500 R14: 0000000000000006 R15: 00007f5b236a3700
[ 12.852247]
[ 12.852466] Allocated by task 225:
[ 12.852893] save_stack+0x1b/0x40
[ 12.853310] __kasan_kmalloc.constprop.0+0xc2/0xd0
[ 12.853918] kmem_cache_alloc_node+0xef/0x270
[ 12.854475] copy_process+0x485/0x6130
[ 12.854945] _do_fork+0xf1/0xb40
[ 12.855353] __do_sys_clone+0xc3/0x100
[ 12.855843] do_syscall_64+0x95/0x4a0
[ 12.856302] entry_SYSCALL_64_after_hwframe+0x49/0xb3
[ 12.856939]
[ 12.857140] Freed by task 0:
[ 12.857522] save_stack+0x1b/0x40
[ 12.857940] __kasan_slab_free+0x12c/0x170
[ 12.858464] kmem_cache_free+0xb0/0x330
[ 12.858945] rcu_core+0x55f/0x19f0
[ 12.859385] __do_softirq+0x228/0x944
[ 12.859869]
[ 12.860075] The buggy address belongs to the object at ffff888065c26200
[ 12.860075] which belongs to the cache task_struct of size 6016
[ 12.861638] The buggy address is located 56 bytes inside of
[ 12.861638] 6016-byte region [ffff888065c26200, ffff888065c27980)
[ 12.863084] The buggy address belongs to the page:
[ 12.863702] page:ffffea0001970800 refcount:1 mapcount:0 mapping:0000000021ee3712 index:0x0 head:ffffea0001970800 order:3 compound_mapcount:0 compound_pincount:0
[ 12.865478] flags: 0x80000000010200(slab|head)
[ 12.866039] raw: 0080000000010200 0000000000000000 0000000100000001 ffff888066c0f980
[ 12.867010] raw: 0000000000000000 0000000080050005 00000001ffffffff 0000000000000000
[ 12.867986] page dumped because: kasan: bad access detected
[ 12.868696]
[ 12.868900] Memory state around the buggy address:
[ 12.869514] ffff888065c26100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 12.870414] ffff888065c26180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 12.871318] >ffff888065c26200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 12.872238] ^
[ 12.872870] ffff888065c26280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 12.873754] ffff888065c26300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 12.874640]
==================================================================
This can be prevented by setting the driver data after initialisation is
complete.
Fixes: 4d88a97aa9e8 ("libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver infrastructure")
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Vishal Verma <vishal.l.verma(a)intel.com>
Cc: Dave Jiang <dave.jiang(a)intel.com>
Cc: Ira Weiny <ira.weiny(a)intel.com>
Cc: linux-nvdimm(a)lists.01.org
Cc: linux-kernel(a)vger.kernel.org
Cc: Coly Li <colyli(a)suse.com>
Signed-off-by: Richard Palethorpe <rpalethorpe(a)suse.com>
---
V2:
+ Reviewed by Coly and removed unecessary lock
drivers/nvdimm/dimm.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/nvdimm/dimm.c b/drivers/nvdimm/dimm.c
index 7d4ddc4d9322..3d3988e1d9a0 100644
--- a/drivers/nvdimm/dimm.c
+++ b/drivers/nvdimm/dimm.c
@@ -43,7 +43,6 @@ static int nvdimm_probe(struct device *dev)
if (!ndd)
return -ENOMEM;
- dev_set_drvdata(dev, ndd);
ndd->dpa.name = dev_name(dev);
ndd->ns_current = -1;
ndd->ns_next = -1;
@@ -106,6 +105,8 @@ static int nvdimm_probe(struct device *dev)
if (rc)
goto err;
+ dev_set_drvdata(dev, ndd);
+
return 0;
err:
--
2.26.2
1 year, 4 months
[PATCH 1/1] ndctl/namespace: Fix disable-namespace accounting relative to seed devices
by Redhairer Li
Seed namespaces are included in "ndctl disable-namespace all". However
since the user never "creates" them it is surprising to see
"disable-namespace" report 1 more namespace relative to the number that
have been created. Catch attempts to disable a zero-sized namespace:
Before:
{
"dev":"namespace1.0",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1"
}
{
"dev":"namespace1.1",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1.1"
}
{
"dev":"namespace1.2",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1.2"
}
disabled 4 namespaces
After:
{
"dev":"namespace1.0",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1"
}
{
"dev":"namespace1.3",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1.3"
}
{
"dev":"namespace1.1",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1.1"
}
disabled 3 namespaces
Signed-off-by: Redhairer Li <redhairer.li(a)intel.com>
---
ndctl/lib/libndctl.c | 11 ++++++++---
ndctl/region.c | 4 +++-
2 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/ndctl/lib/libndctl.c b/ndctl/lib/libndctl.c
index ee737cb..49f362b 100644
--- a/ndctl/lib/libndctl.c
+++ b/ndctl/lib/libndctl.c
@@ -4231,6 +4231,7 @@ NDCTL_EXPORT int ndctl_namespace_disable_safe(struct ndctl_namespace *ndns)
const char *bdev = NULL;
char path[50];
int fd;
+ unsigned long long size = ndctl_namespace_get_size(ndns);
if (pfn && ndctl_pfn_is_enabled(pfn))
bdev = ndctl_pfn_get_block_device(pfn);
@@ -4260,9 +4261,13 @@ NDCTL_EXPORT int ndctl_namespace_disable_safe(struct ndctl_namespace *ndns)
devname, bdev, strerror(errno));
return -errno;
}
- } else
- ndctl_namespace_disable_invalidate(ndns);
-
+ } else {
+ if (size == 0)
+ /* Don't try to disable idle namespace (no capacity allocated) */
+ return -ENXIO;
+ else
+ ndctl_namespace_disable_invalidate(ndns);
+ }
return 0;
}
diff --git a/ndctl/region.c b/ndctl/region.c
index 7945007..0014bb9 100644
--- a/ndctl/region.c
+++ b/ndctl/region.c
@@ -72,6 +72,7 @@ static int region_action(struct ndctl_region *region, enum device_action mode)
{
struct ndctl_namespace *ndns;
int rc = 0;
+ unsigned long long size;
switch (mode) {
case ACTION_ENABLE:
@@ -80,7 +81,8 @@ static int region_action(struct ndctl_region *region, enum device_action mode)
case ACTION_DISABLE:
ndctl_namespace_foreach(region, ndns) {
rc = ndctl_namespace_disable_safe(ndns);
- if (rc)
+ size = ndctl_namespace_get_size(ndns);
+ if (rc && size != 0)
return rc;
}
rc = ndctl_region_disable_invalidate(region);
--
2.20.1.windows.1
1 year, 5 months
[PATCH ndctl v1 0/8] daxctl: Add device align and range mapping allocation
by Joao Martins
Hey,
This series builds on top of this one[0] and does the following improvements
to the Soft-Reserved subdivision:
1) Support for {create,reconfigure}-device for selecting @align (hugepage size).
Here we add a '-a|--align 4K|2M|1G' option to the existing commands;
2) Listing improvements for device alignment and mappings;
Note: Perhaps it is better to hide the mappings by default, and only
print with -v|--verbose. This would align with ndctl, as the mappings
info can be quite large.
3) Allow creating devices from selecting ranges. This allows to keep the
same GPA->HPA mapping as before we kexec the hypervisor with running guests:
daxctl list -d dax0.1 > /var/log/dax0.1.json
kexec -d -l bzImage
systemctl kexec
daxctl create -u --restore /var/log/dax0.1.json
The JSON was what I though it would be easier for an user, given that it is
the data format daxctl outputs. Alternatives could be adding multiple:
--mapping <pgoff>:<start>-<end>
But that could end up in a gigantic line and a little more
unmanageable I think.
This series requires this series[0] on top of Dan's patches[1]:
[0] https://lore.kernel.org/linux-nvdimm/20200716172913.19658-1-joao.m.martin...
[1] https://lore.kernel.org/linux-nvdimm/159457116473.754248.7879464730875147...
The only TODO here is docs and improving tests to validate mappings, and test
the restore path.
Suggestions/comments are welcome.
Joao
Joao Martins (8):
daxctl: add daxctl_dev_{get,set}_align()
util/json: Print device align
daxctl: add align support in reconfigure-device
daxctl: add align support in create-device
libdaxctl: add mapping iterator APIs
daxctl: include mappings when listing
libdaxctl: add daxctl_dev_set_mapping()
daxctl: Allow restore devices from JSON metadata
daxctl/device.c | 154 +++++++++++++++++++++++++++++++++++++++--
daxctl/lib/libdaxctl-private.h | 9 +++
daxctl/lib/libdaxctl.c | 152 +++++++++++++++++++++++++++++++++++++++-
daxctl/lib/libdaxctl.sym | 9 +++
daxctl/libdaxctl.h | 16 +++++
util/json.c | 63 ++++++++++++++++-
util/json.h | 3 +
7 files changed, 396 insertions(+), 10 deletions(-)
--
1.8.3.1
1 year, 6 months
[PATCH ndctl v2 00/10] daxctl: Support for sub-dividing soft-reserved regions
by Joao Martins
Changes since v1:
* Add a Documentation/daxctl/ entry for each patch that adds commands or new
option.
* Fix functional test suite to only change region 0 and not touch others
* Fix reconfigure-device -s changes (third patch) for better bisection.
v1: https://lore.kernel.org/linux-nvdimm/20200403205900.18035-1-joao.m.martin...
---
This series introduces the daxctl support for sub-dividing soft-reserved
regions created by EFI/HMAT/efi_fake_mem. It's the userspace counterpart
of this recent patch series [0].
These new 'dynamic' regions can be partitioned into multiple different devices
which its subdivisions can consist of one or more ranges. This
is in contrast to static dax regions -- created with ndctl-create-namespace
-m devdax -- which can't be subdivided neither discontiguous.
See also cover-letter of [0].
The daxctl changes in these patches are depicted as:
* {create,destroy,disable,enable}-device:
These orchestrate/manage the sub-division devices.
It mimmics the same as namespaces equivalent commands.
* Allow reconfigure-device to change the size of an existing *dynamic* dax
device.
* Add test coverage (Tried to cover all range allocation code paths).
v2 of kernel patches now passes this test suite.
* Documentation regarding the new command additions.
[0] "device-dax: Support sub-dividing soft-reserved ranges",
https://lore.kernel.org/linux-nvdimm/159457116473.754248.7879464730875147...
Dan Williams (1):
daxctl: Cleanup whitespace
Joao Martins (9):
libdaxctl: add daxctl_dev_set_size()
daxctl: add resize support in reconfigure-device
daxctl: add command to disable devdax device
daxctl: add command to enable devdax device
libdaxctl: add daxctl_region_create_dev()
daxctl: add command to create device
libdaxctl: add daxctl_region_destroy_dev()
daxctl: add command to destroy device
daxctl/test: Add tests for dynamic dax regions
Documentation/daxctl/Makefile.am | 6 +-
Documentation/daxctl/daxctl-create-device.txt | 105 +++++++
Documentation/daxctl/daxctl-destroy-device.txt | 63 +++++
Documentation/daxctl/daxctl-disable-device.txt | 58 ++++
Documentation/daxctl/daxctl-enable-device.txt | 59 ++++
Documentation/daxctl/daxctl-reconfigure-device.txt | 16 ++
daxctl/builtin.h | 4 +
daxctl/daxctl.c | 4 +
daxctl/device.c | 310 ++++++++++++++++++++-
daxctl/lib/libdaxctl.c | 67 +++++
daxctl/lib/libdaxctl.sym | 7 +
daxctl/libdaxctl.h | 3 +
test/Makefile.am | 1 +
test/daxctl-create.sh | 294 +++++++++++++++++++
util/filter.c | 2 +-
15 files changed, 993 insertions(+), 6 deletions(-)
create mode 100644 Documentation/daxctl/daxctl-create-device.txt
create mode 100644 Documentation/daxctl/daxctl-destroy-device.txt
create mode 100644 Documentation/daxctl/daxctl-disable-device.txt
create mode 100644 Documentation/daxctl/daxctl-enable-device.txt
create mode 100755 test/daxctl-create.sh
--
1.8.3.1
1 year, 6 months
Feedback requested: Exposing NVDIMM performance statistics in a generic way
by Vaibhav Jain
Hello,
I am looking for some community feedback on these two Problem-statements:
1.How to expose NVDIMM performance statistics in an arch or nvdimm vendor
agnostic manner ?
2. Is there a common set of performance statistics for NVDIMMs that all
vendors should provide ?
Problem context
===============
While working on bring up of PAPR SCM based NVDIMMs[1] for arch/powerpc
we want to expose certain dimm performance statistics like "Media
Read/Write Counts", "Power-on Seconds" etc to user-space [2]. These
performance statistics are similar to what ipmctl[3] reports for Intel®
Optane™ persistent memory via the '-show performance' command line
arg. However the reported set of performance stats doesn't cover the
entirety of all performance stats supported by PAPR SCM based NVDimms.
For example here is a subset of performance stats which are specific to
PAPR SCM NVDimms and that not reported by ipmctl:
* Controller Reset Count
* Controller Reset Elapsed Time
* Power-on Seconds
* Cache Read Hit Count
* Cache Write Hit Count
Possibility of updating ipmctl to add support for these performance
statistics is greatly hampered by no support for ACPI on Powerpc
arch. Secondly vendors who dont support ACPI/NFIT command set
similar to Intel® Optane™ (Example MSFT) are also left out in
lurch. Problem-statement#1 points to this specific problem.
Additionally in absence of any pre-agreed set of performance statistics
which all vendors should support, adding support for such a
functionality in ipmctl may not bode well of other nvdimm vendors. For
example if support for reporting "Controller Reset Count" is added to
ipmctl then it may not be applicable to other vendors such as Intel®
Optane™. This issue is what Problem-statement#2 refers to.
Possible Solution for Problem#1
===============================
One possible solution to Problem#1 can to add support for reporting
NVDIMM performance statistics in 'ndtcl'. 'libndctl' already has a layer
that abstracts underlying NVDIMM vendors (via struct ndctl_dimm_ops),
making supporting different NVDIMM vendors fairly easy. Also ndctl is
more widely used compared to 'ipmctl', hence adding such a functionality
to ndctl would make it more widely used.
Above solution was implemented as RFC patch-set[2] that exposes these
performance statistics through a generic abstraction in libndctl and
added a presentation layer for this data in ndctl[4]. It added a new
command line flags '--stat' to ndctl to report *all* nvdimm vendor
reported performance stats. The output is similar to one below:
# ndctl list -D --stats
[
{
"dev":"nmem0",
"stats":{
"Power-on Seconds":603931,
"Media Read Count":0,
"Media Write Count":6313,
}
}
]
This was done by adding two new dimm-ops callbacks that were
implemented by the papr_scm implementation within libndctl. These
callbacks are invoked by newly introduce code in 'util/json-smart.c'
that format the returned stats from these new dimm-ops and transform
them into a json-object to later presentation. I would request you to
look at RFC patch-set[2] to understand the implementation details.
Possibled Solution for Problem#2
================================
Solution to Problem-statement#2 is what eludes me though. If there is a
minimal set of performance stats (similar to what ndctl enforces for
health-stats) then implementation of such a functionality in
ndctl/ipmctl would be easy to implement. But is it really possible to
have such a common set of performance stats that NVDIMM vendors can
expose.
Patch-set[2] though tries to bypass this problem by letting the vendor
descide which performance stats to expose. This opens up a possibility
of this functionality to abused by dimm vendors to reports arbirary data
through this flag that may not be performance-stats.
Summing-up
==========
In light of above, requesting your feedback as to how
problem-statements#{1, 2} can be addressed within ndctl subsystem. Also
are these problems even worth solving.
References
==========
[1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/papr_...
[2] "[ndctl RFC-PATCH 0/4] Add support for reporting PAPR NVDIMM
Statistics"
https://lore.kernel.org/linux-nvdimm/20200518110814.145644-1-vaibhav@linu...
[3] https://docs.pmem.io/ipmctl-user-guide/instrumentation/show-device-perfor...
[4] "[RFC-PATCH 1/4] ndctl,libndctl: Implement new dimm-ops 'new_stats'
and 'get_stat'"
https://lore.kernel.org/linux-nvdimm/20200514225258.508463-2-vaibhav@linu...
Thanks,
~ Vaibhav
1 year, 7 months
[RFC PATCH 0/4] powerpc/papr_scm: Add support for reporting NVDIMM performance statistics
by Vaibhav Jain
The patch-set proposes to add support for fetching and reporting
performance statistics for PAPR compliant NVDIMMs as described in
documentation for H_SCM_PERFORMANCE_STATS hcall Ref[1]. The patch-set
also implements mechanisms to expose NVDIMM performance stats via
sysfs and newly introduced PDSMs[2] for libndctl.
This patch-set combined with corresponding ndctl and libndctl changes
proposed at Ref[3] should enable user to fetch PAPR compliant NVDIMMs
using following command:
# ndctl list -D --stats
[
{
"dev":"nmem0",
"stats":{
"Controller Reset Count":2,
"Controller Reset Elapsed Time":603331,
"Power-on Seconds":603931,
"Life Remaining":"100%",
"Critical Resource Utilization":"0%",
"Host Load Count":5781028,
"Host Store Count":8966800,
"Host Load Duration":975895365,
"Host Store Duration":716230690,
"Media Read Count":0,
"Media Write Count":6313,
"Media Read Duration":0,
"Media Write Duration":9679615,
"Cache Read Hit Count":5781028,
"Cache Write Hit Count":8442479,
"Fast Write Count":8969912
}
}
]
The patchset is dependent on existing patch-set "[PATCH v7 0/5]
powerpc/papr_scm: Add support for reporting nvdimm health" available
at Ref[2] that adds support for reporting PAPR compliant NVDIMMs in
'papr_scm' kernel module.
Structure of the patch-set
==========================
The patch-set starts with implementing functionality in papr_scm
module to issue H_SCM_PERFORMANCE_STATS hcall, fetch & parse dimm
performance stats and exposing them as a PAPR specific libnvdimm
attribute named 'perf_stats'
Patch-2 introduces a new PDSM named FETCH_PERF_STATS that can be
issued by libndctl asking papr_scm to issue the
H_SCM_PERFORMANCE_STATS hcall using helpers introduced earlier and
storing the results in a dimm specific perf-stats-buffer.
Patch-3 introduces a new PDSM named READ_PERF_STATS that can be
issued by libndctl to read the perf-stats-buffer in an incremental
manner to workaround the 256-bytes envelop limitation of libnvdimm.
Finally Patch-4 introduces a new PDSM named GET_PERF_STAT that can be
issued by libndctl to read values of a specific NVDIMM performance
stat like "Life Remaining".
References
==========
[1] Documentation/powerpc/papr_hcals.rst
[2] https://lore.kernel.org/linux-nvdimm/20200508104922.72565-1-vaibhav@linux...
[3] https://github.com/vaibhav92/ndctl/tree/papr_scm_stats_v1
Vaibhav Jain (4):
powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
powerpc/papr_scm: Add support for PAPR_SCM_PDSM_FETCH_PERF_STATS
powerpc/papr_scm: Implement support for PAPR_SCM_PDSM_READ_PERF_STATS
powerpc/papr_scm: Add support for PDSM GET_PERF_STAT
Documentation/ABI/testing/sysfs-bus-papr-scm | 27 ++
arch/powerpc/include/uapi/asm/papr_scm_pdsm.h | 60 +++
arch/powerpc/platforms/pseries/papr_scm.c | 391 ++++++++++++++++++
3 files changed, 478 insertions(+)
--
2.26.2
1 year, 8 months
[PATCH 1/2] ndctl/namespace: skip zero namespaces when processing all namespaces.
by Michal Suchanek
The kernel always creates zero length namespace with uuid 0 in each
region.
When processing all namespaces the user gets confusing errors from ndctl
trying to process this namespace. Skip it.
The user can still specify the namespace by name directly in case
processing it is desirable.
Fixes: #41
Reviewed-by: Santosh S <santosh(a)fossix.org>
Tested-by: Harish Sriram <harish(a)linux.ibm.com>
Signed-off-by: Michal Suchanek <msuchanek(a)suse.de>
---
ndctl/namespace.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/ndctl/namespace.c b/ndctl/namespace.c
index e734248c9752..3fabe4799d75 100644
--- a/ndctl/namespace.c
+++ b/ndctl/namespace.c
@@ -2171,9 +2171,19 @@ static int do_xaction_namespace(const char *namespace,
ndctl_namespace_foreach_safe(region, ndns, _n) {
ndns_name = ndctl_namespace_get_devname(ndns);
- if (strcmp(namespace, "all") != 0
- && strcmp(namespace, ndns_name) != 0)
- continue;
+ if (strcmp(namespace, "all") == 0) {
+ static const uuid_t zero_uuid;
+ uuid_t uuid;
+
+ ndctl_namespace_get_uuid(ndns, uuid);
+ if (!ndctl_namespace_get_size(ndns) &&
+ !memcmp(uuid, zero_uuid, sizeof(uuid_t)))
+ continue;
+ } else {
+ if (strcmp(namespace, ndns_name) != 0)
+ continue;
+ }
+
switch (action) {
case ACTION_DISABLE:
rc = ndctl_namespace_disable_safe(ndns);
--
2.26.2
1 year, 10 months
Can we change the S_DAX flag immediately on XFS without dropping
caches?
by Li, Hao
Hi,
I have noticed that we have to drop caches to make the changing of S_DAX
flag take effect after using chattr +x to turn on DAX for a existing
regular file. The related function is xfs_diflags_to_iflags, whose
second parameter determines whether we should set S_DAX immediately.
I can't figure out why we do this. Is this because the page caches in
address_space->i_pages are hard to deal with? I also wonder what will
happen if we set S_DAX unconditionally. Thanks!
Regards,
Hao Li
1 year, 10 months
[RFC 1/1] pmem: Add cond_resched() in bio_for_each_segment loop in pmem_make_request
by Ritesh Harjani
For systems which do not have CONFIG_PREEMPT set and
if there is a heavy multi-threaded load/store operation happening
on pmem + sometimes along with device latencies, softlockup warnings like
this could trigger. This was seen on Power where pagesize is 64K.
To avoid softlockup, this patch adds a cond_resched() in this path.
<...>
watchdog: BUG: soft lockup - CPU#31 stuck for 22s!
<...>
CPU: 31 PID: 15627 <..> 5.3.18-20
<...>
NIP memcpy_power7+0x43c/0x7e0
LR memcpy_flushcache+0x28/0xa0
Call Trace:
memcpy_power7+0x274/0x7e0 (unreliable)
memcpy_flushcache+0x28/0xa0
write_pmem+0xa0/0x100 [nd_pmem]
pmem_do_bvec+0x1f0/0x420 [nd_pmem]
pmem_make_request+0x14c/0x370 [nd_pmem]
generic_make_request+0x164/0x400
submit_bio+0x134/0x2e0
submit_bio_wait+0x70/0xc0
blkdev_issue_zeroout+0xf4/0x2a0
xfs_zero_extent+0x90/0xc0 [xfs]
xfs_bmapi_convert_unwritten+0x198/0x230 [xfs]
xfs_bmapi_write+0x284/0x630 [xfs]
xfs_iomap_write_direct+0x1f0/0x3e0 [xfs]
xfs_file_iomap_begin+0x344/0x690 [xfs]
dax_iomap_pmd_fault+0x488/0xc10
__xfs_filemap_fault+0x26c/0x2b0 [xfs]
__handle_mm_fault+0x794/0x1af0
handle_mm_fault+0x12c/0x220
__do_page_fault+0x290/0xe40
do_page_fault+0x38/0xc0
handle_page_fault+0x10/0x30
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar(a)linux.ibm.com>
Signed-off-by: Ritesh Harjani <riteshh(a)linux.ibm.com>
---
drivers/nvdimm/pmem.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 2df6994acf83..fcf7af13897e 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -214,6 +214,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
bio->bi_status = rc;
break;
}
+ cond_resched();
}
if (do_acct)
nd_iostat_end(bio, start);
--
2.25.4
1 year, 10 months