[RFC 0/2] virtio-pmem: Asynchronous flush
by Pankaj Gupta
Jeff reported preflush order issue with the existing implementation
of virtio pmem preflush. Dan suggested[1] to implement asynchronous flush
for virtio pmem using work queue as done in md/RAID. This patch series
intends to solve the preflush ordering issue and also makes the flush
asynchronous from the submitting thread POV.
Submitting this patch series for feeback and is in WIP. I have
done basic testing and currently doing more testing.
Pankaj Gupta (2):
pmem: make nvdimm_flush asynchronous
virtio_pmem: Async virtio-pmem flush
drivers/nvdimm/nd_virtio.c | 66 ++++++++++++++++++++++++++----------
drivers/nvdimm/pmem.c | 15 ++++----
drivers/nvdimm/region_devs.c | 3 +-
drivers/nvdimm/virtio_pmem.c | 9 +++++
drivers/nvdimm/virtio_pmem.h | 12 +++++++
5 files changed, 78 insertions(+), 27 deletions(-)
[1] https://marc.info/?l=linux-kernel&m=157446316409937&w=2
--
2.20.1
1 year, 2 months
[PATCH v2] nvdimm: Avoid race between probe and reading device attributes
by Richard Palethorpe
It is possible to cause a division error and use-after-free by querying the
nmem device before the driver data is fully initialised in nvdimm_probe. E.g
by doing
(while true; do
cat /sys/bus/nd/devices/nmem*/available_slots 2>&1 > /dev/null
done) &
while true; do
for i in $(seq 0 4); do
echo nmem$i > /sys/bus/nd/drivers/nvdimm/bind
done
for i in $(seq 0 4); do
echo nmem$i > /sys/bus/nd/drivers/nvdimm/unbind
done
done
On 5.7-rc3 this causes:
[ 12.711578] divide error: 0000 [#1] SMP KASAN PTI
[ 12.712321] CPU: 0 PID: 231 Comm: cat Not tainted 5.7.0-rc3 #48
[ 12.713188] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[ 12.714857] RIP: 0010:nd_label_nfree+0x134/0x1a0 [libnvdimm]
[ 12.715772] Code: ba 00 00 00 00 00 fc ff df 48 89 f9 48 c1 e9 03 0f b6 14 11 84 d2 74 05 80 fa 03 7e 52 8b 73 08 31 d2 89 c1 48 83 c4 08 5b 5d <f7> f6 31 d2 41 5c 83 c0 07 c1 e8 03 48 8d 84 00 8e 02 00 00 25 00
[ 12.718311] RSP: 0018:ffffc9000046fd08 EFLAGS: 00010282
[ 12.719030] RAX: 0000000000000000 RBX: ffffffffc0073aa0 RCX: 0000000000000000
[ 12.720005] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888060931808
[ 12.720970] RBP: ffff88806609d018 R08: 0000000000000001 R09: ffffed100cc0a2b1
[ 12.721889] R10: ffff888066051587 R11: ffffed100cc0a2b0 R12: ffff888060931800
[ 12.722744] R13: ffff888064362000 R14: ffff88806609d018 R15: ffffffff8b1a2520
[ 12.723602] FS: 00007fd16f3d5580(0000) GS:ffff88806b400000(0000) knlGS:0000000000000000
[ 12.724600] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 12.725308] CR2: 00007fd16f1ec000 CR3: 0000000064322006 CR4: 0000000000160ef0
[ 12.726268] Call Trace:
[ 12.726633] available_slots_show+0x4e/0x120 [libnvdimm]
[ 12.727380] dev_attr_show+0x42/0x80
[ 12.727891] ? memset+0x20/0x40
[ 12.728341] sysfs_kf_seq_show+0x218/0x410
[ 12.728923] seq_read+0x389/0xe10
[ 12.729415] vfs_read+0x101/0x2d0
[ 12.729891] ksys_read+0xf9/0x1d0
[ 12.730361] ? kernel_write+0x120/0x120
[ 12.730915] do_syscall_64+0x95/0x4a0
[ 12.731435] entry_SYSCALL_64_after_hwframe+0x49/0xb3
[ 12.732163] RIP: 0033:0x7fd16f2fe4be
[ 12.732685] Code: c0 e9 c6 fe ff ff 50 48 8d 3d 2e 12 0a 00 e8 69 e9 01 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
[ 12.735207] RSP: 002b:00007ffd3177b838 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 12.736261] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fd16f2fe4be
[ 12.737233] RDX: 0000000000020000 RSI: 00007fd16f1ed000 RDI: 0000000000000003
[ 12.738203] RBP: 00007fd16f1ed000 R08: 00007fd16f1ec010 R09: 0000000000000000
[ 12.739172] R10: 00007fd16f3f4f70 R11: 0000000000000246 R12: 00007ffd3177ce23
[ 12.740144] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
[ 12.741139] Modules linked in: nfit libnvdimm
[ 12.741783] ---[ end trace 99532e4b82410044 ]---
[ 12.742452] RIP: 0010:nd_label_nfree+0x134/0x1a0 [libnvdimm]
[ 12.743167] Code: ba 00 00 00 00 00 fc ff df 48 89 f9 48 c1 e9 03 0f b6 14 11 84 d2 74 05 80 fa 03 7e 52 8b 73 08 31 d2 89 c1 48 83 c4 08 5b 5d <f7> f6 31 d2 41 5c 83 c0 07 c1 e8 03 48 8d 84 00 8e 02 00 00 25 00
[ 12.745709] RSP: 0018:ffffc9000046fd08 EFLAGS: 00010282
[ 12.746340] RAX: 0000000000000000 RBX: ffffffffc0073aa0 RCX: 0000000000000000
[ 12.747209] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888060931808
[ 12.748081] RBP: ffff88806609d018 R08: 0000000000000001 R09: ffffed100cc0a2b1
[ 12.748977] R10: ffff888066051587 R11: ffffed100cc0a2b0 R12: ffff888060931800
[ 12.749849] R13: ffff888064362000 R14: ffff88806609d018 R15: ffffffff8b1a2520
[ 12.750729] FS: 00007fd16f3d5580(0000) GS:ffff88806b400000(0000) knlGS:0000000000000000
[ 12.751708] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 12.752441] CR2: 00007fd16f1ec000 CR3: 0000000064322006 CR4: 0000000000160ef0
[ 12.821357] ==================================================================
[ 12.822284] BUG: KASAN: use-after-free in __mutex_lock+0x111c/0x11a0
[ 12.823084] Read of size 4 at addr ffff888065c26238 by task reproducer/218
[ 12.823968]
[ 12.824183] CPU: 2 PID: 218 Comm: reproducer Tainted: G D 5.7.0-rc3 #48
[ 12.825167] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[ 12.826595] Call Trace:
[ 12.826926] dump_stack+0x97/0xe0
[ 12.827362] print_address_description.constprop.0+0x1b/0x210
[ 12.828111] ? __mutex_lock+0x111c/0x11a0
[ 12.828645] __kasan_report.cold+0x37/0x92
[ 12.829179] ? __mutex_lock+0x111c/0x11a0
[ 12.829706] kasan_report+0x38/0x50
[ 12.830158] __mutex_lock+0x111c/0x11a0
[ 12.830666] ? ftrace_graph_stop+0x10/0x10
[ 12.831193] ? is_nvdimm_bus+0x40/0x40 [libnvdimm]
[ 12.831820] ? mutex_trylock+0x2b0/0x2b0
[ 12.832333] ? nvdimm_probe+0x259/0x420 [libnvdimm]
[ 12.832975] ? mutex_trylock+0x2b0/0x2b0
[ 12.833500] ? nvdimm_probe+0x259/0x420 [libnvdimm]
[ 12.834122] ? prepare_ftrace_return+0xa1/0xf0
[ 12.834724] ? ftrace_graph_caller+0x6b/0xa0
[ 12.835269] ? acpi_label_write+0x390/0x390 [nfit]
[ 12.835909] ? nvdimm_probe+0x259/0x420 [libnvdimm]
[ 12.836558] ? nvdimm_probe+0x259/0x420 [libnvdimm]
[ 12.837179] nvdimm_probe+0x259/0x420 [libnvdimm]
[ 12.837802] nvdimm_bus_probe+0x110/0x6b0 [libnvdimm]
[ 12.838470] really_probe+0x212/0x9a0
[ 12.838954] driver_probe_device+0x1cd/0x300
[ 12.839511] ? driver_probe_device+0x5/0x300
[ 12.840063] device_driver_attach+0xe7/0x120
[ 12.840623] bind_store+0x18d/0x230
[ 12.841075] kernfs_fop_write+0x200/0x420
[ 12.841606] vfs_write+0x154/0x450
[ 12.842047] ksys_write+0xf9/0x1d0
[ 12.842497] ? __ia32_sys_read+0xb0/0xb0
[ 12.843010] do_syscall_64+0x95/0x4a0
[ 12.843495] entry_SYSCALL_64_after_hwframe+0x49/0xb3
[ 12.844140] RIP: 0033:0x7f5b235d3563
[ 12.844607] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 48 83 ec 28 48 89 54 24 18
[ 12.846877] RSP: 002b:00007fff1c3bc578 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 12.847822] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f5b235d3563
[ 12.848717] RDX: 0000000000000006 RSI: 000055f9576710d0 RDI: 0000000000000001
[ 12.849594] RBP: 000055f9576710d0 R08: 000000000000000a R09: 0000000000000000
[ 12.850470] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000006
[ 12.851333] R13: 00007f5b236a3500 R14: 0000000000000006 R15: 00007f5b236a3700
[ 12.852247]
[ 12.852466] Allocated by task 225:
[ 12.852893] save_stack+0x1b/0x40
[ 12.853310] __kasan_kmalloc.constprop.0+0xc2/0xd0
[ 12.853918] kmem_cache_alloc_node+0xef/0x270
[ 12.854475] copy_process+0x485/0x6130
[ 12.854945] _do_fork+0xf1/0xb40
[ 12.855353] __do_sys_clone+0xc3/0x100
[ 12.855843] do_syscall_64+0x95/0x4a0
[ 12.856302] entry_SYSCALL_64_after_hwframe+0x49/0xb3
[ 12.856939]
[ 12.857140] Freed by task 0:
[ 12.857522] save_stack+0x1b/0x40
[ 12.857940] __kasan_slab_free+0x12c/0x170
[ 12.858464] kmem_cache_free+0xb0/0x330
[ 12.858945] rcu_core+0x55f/0x19f0
[ 12.859385] __do_softirq+0x228/0x944
[ 12.859869]
[ 12.860075] The buggy address belongs to the object at ffff888065c26200
[ 12.860075] which belongs to the cache task_struct of size 6016
[ 12.861638] The buggy address is located 56 bytes inside of
[ 12.861638] 6016-byte region [ffff888065c26200, ffff888065c27980)
[ 12.863084] The buggy address belongs to the page:
[ 12.863702] page:ffffea0001970800 refcount:1 mapcount:0 mapping:0000000021ee3712 index:0x0 head:ffffea0001970800 order:3 compound_mapcount:0 compound_pincount:0
[ 12.865478] flags: 0x80000000010200(slab|head)
[ 12.866039] raw: 0080000000010200 0000000000000000 0000000100000001 ffff888066c0f980
[ 12.867010] raw: 0000000000000000 0000000080050005 00000001ffffffff 0000000000000000
[ 12.867986] page dumped because: kasan: bad access detected
[ 12.868696]
[ 12.868900] Memory state around the buggy address:
[ 12.869514] ffff888065c26100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 12.870414] ffff888065c26180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 12.871318] >ffff888065c26200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 12.872238] ^
[ 12.872870] ffff888065c26280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 12.873754] ffff888065c26300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 12.874640]
==================================================================
This can be prevented by setting the driver data after initialisation is
complete.
Fixes: 4d88a97aa9e8 ("libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver infrastructure")
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Vishal Verma <vishal.l.verma(a)intel.com>
Cc: Dave Jiang <dave.jiang(a)intel.com>
Cc: Ira Weiny <ira.weiny(a)intel.com>
Cc: linux-nvdimm(a)lists.01.org
Cc: linux-kernel(a)vger.kernel.org
Cc: Coly Li <colyli(a)suse.com>
Signed-off-by: Richard Palethorpe <rpalethorpe(a)suse.com>
---
V2:
+ Reviewed by Coly and removed unecessary lock
drivers/nvdimm/dimm.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/nvdimm/dimm.c b/drivers/nvdimm/dimm.c
index 7d4ddc4d9322..3d3988e1d9a0 100644
--- a/drivers/nvdimm/dimm.c
+++ b/drivers/nvdimm/dimm.c
@@ -43,7 +43,6 @@ static int nvdimm_probe(struct device *dev)
if (!ndd)
return -ENOMEM;
- dev_set_drvdata(dev, ndd);
ndd->dpa.name = dev_name(dev);
ndd->ns_current = -1;
ndd->ns_next = -1;
@@ -106,6 +105,8 @@ static int nvdimm_probe(struct device *dev)
if (rc)
goto err;
+ dev_set_drvdata(dev, ndd);
+
return 0;
err:
--
2.26.2
1 year, 3 months
[PATCH 1/1] ndctl/namespace: Fix disable-namespace accounting relative to seed devices
by Redhairer Li
Seed namespaces are included in "ndctl disable-namespace all". However
since the user never "creates" them it is surprising to see
"disable-namespace" report 1 more namespace relative to the number that
have been created. Catch attempts to disable a zero-sized namespace:
Before:
{
"dev":"namespace1.0",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1"
}
{
"dev":"namespace1.1",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1.1"
}
{
"dev":"namespace1.2",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1.2"
}
disabled 4 namespaces
After:
{
"dev":"namespace1.0",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1"
}
{
"dev":"namespace1.3",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1.3"
}
{
"dev":"namespace1.1",
"size":"492.00 MiB (515.90 MB)",
"blockdev":"pmem1.1"
}
disabled 3 namespaces
Signed-off-by: Redhairer Li <redhairer.li(a)intel.com>
---
ndctl/lib/libndctl.c | 11 ++++++++---
ndctl/region.c | 4 +++-
2 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/ndctl/lib/libndctl.c b/ndctl/lib/libndctl.c
index ee737cb..49f362b 100644
--- a/ndctl/lib/libndctl.c
+++ b/ndctl/lib/libndctl.c
@@ -4231,6 +4231,7 @@ NDCTL_EXPORT int ndctl_namespace_disable_safe(struct ndctl_namespace *ndns)
const char *bdev = NULL;
char path[50];
int fd;
+ unsigned long long size = ndctl_namespace_get_size(ndns);
if (pfn && ndctl_pfn_is_enabled(pfn))
bdev = ndctl_pfn_get_block_device(pfn);
@@ -4260,9 +4261,13 @@ NDCTL_EXPORT int ndctl_namespace_disable_safe(struct ndctl_namespace *ndns)
devname, bdev, strerror(errno));
return -errno;
}
- } else
- ndctl_namespace_disable_invalidate(ndns);
-
+ } else {
+ if (size == 0)
+ /* Don't try to disable idle namespace (no capacity allocated) */
+ return -ENXIO;
+ else
+ ndctl_namespace_disable_invalidate(ndns);
+ }
return 0;
}
diff --git a/ndctl/region.c b/ndctl/region.c
index 7945007..0014bb9 100644
--- a/ndctl/region.c
+++ b/ndctl/region.c
@@ -72,6 +72,7 @@ static int region_action(struct ndctl_region *region, enum device_action mode)
{
struct ndctl_namespace *ndns;
int rc = 0;
+ unsigned long long size;
switch (mode) {
case ACTION_ENABLE:
@@ -80,7 +81,8 @@ static int region_action(struct ndctl_region *region, enum device_action mode)
case ACTION_DISABLE:
ndctl_namespace_foreach(region, ndns) {
rc = ndctl_namespace_disable_safe(ndns);
- if (rc)
+ size = ndctl_namespace_get_size(ndns);
+ if (rc && size != 0)
return rc;
}
rc = ndctl_region_disable_invalidate(region);
--
2.20.1.windows.1
1 year, 3 months
[PATCH ndctl v1 0/8] daxctl: Add device align and range mapping allocation
by Joao Martins
Hey,
This series builds on top of this one[0] and does the following improvements
to the Soft-Reserved subdivision:
1) Support for {create,reconfigure}-device for selecting @align (hugepage size).
Here we add a '-a|--align 4K|2M|1G' option to the existing commands;
2) Listing improvements for device alignment and mappings;
Note: Perhaps it is better to hide the mappings by default, and only
print with -v|--verbose. This would align with ndctl, as the mappings
info can be quite large.
3) Allow creating devices from selecting ranges. This allows to keep the
same GPA->HPA mapping as before we kexec the hypervisor with running guests:
daxctl list -d dax0.1 > /var/log/dax0.1.json
kexec -d -l bzImage
systemctl kexec
daxctl create -u --restore /var/log/dax0.1.json
The JSON was what I though it would be easier for an user, given that it is
the data format daxctl outputs. Alternatives could be adding multiple:
--mapping <pgoff>:<start>-<end>
But that could end up in a gigantic line and a little more
unmanageable I think.
This series requires this series[0] on top of Dan's patches[1]:
[0] https://lore.kernel.org/linux-nvdimm/20200716172913.19658-1-joao.m.martin...
[1] https://lore.kernel.org/linux-nvdimm/159457116473.754248.7879464730875147...
The only TODO here is docs and improving tests to validate mappings, and test
the restore path.
Suggestions/comments are welcome.
Joao
Joao Martins (8):
daxctl: add daxctl_dev_{get,set}_align()
util/json: Print device align
daxctl: add align support in reconfigure-device
daxctl: add align support in create-device
libdaxctl: add mapping iterator APIs
daxctl: include mappings when listing
libdaxctl: add daxctl_dev_set_mapping()
daxctl: Allow restore devices from JSON metadata
daxctl/device.c | 154 +++++++++++++++++++++++++++++++++++++++--
daxctl/lib/libdaxctl-private.h | 9 +++
daxctl/lib/libdaxctl.c | 152 +++++++++++++++++++++++++++++++++++++++-
daxctl/lib/libdaxctl.sym | 9 +++
daxctl/libdaxctl.h | 16 +++++
util/json.c | 63 ++++++++++++++++-
util/json.h | 3 +
7 files changed, 396 insertions(+), 10 deletions(-)
--
1.8.3.1
1 year, 5 months
[PATCH ndctl v2 00/10] daxctl: Support for sub-dividing soft-reserved regions
by Joao Martins
Changes since v1:
* Add a Documentation/daxctl/ entry for each patch that adds commands or new
option.
* Fix functional test suite to only change region 0 and not touch others
* Fix reconfigure-device -s changes (third patch) for better bisection.
v1: https://lore.kernel.org/linux-nvdimm/20200403205900.18035-1-joao.m.martin...
---
This series introduces the daxctl support for sub-dividing soft-reserved
regions created by EFI/HMAT/efi_fake_mem. It's the userspace counterpart
of this recent patch series [0].
These new 'dynamic' regions can be partitioned into multiple different devices
which its subdivisions can consist of one or more ranges. This
is in contrast to static dax regions -- created with ndctl-create-namespace
-m devdax -- which can't be subdivided neither discontiguous.
See also cover-letter of [0].
The daxctl changes in these patches are depicted as:
* {create,destroy,disable,enable}-device:
These orchestrate/manage the sub-division devices.
It mimmics the same as namespaces equivalent commands.
* Allow reconfigure-device to change the size of an existing *dynamic* dax
device.
* Add test coverage (Tried to cover all range allocation code paths).
v2 of kernel patches now passes this test suite.
* Documentation regarding the new command additions.
[0] "device-dax: Support sub-dividing soft-reserved ranges",
https://lore.kernel.org/linux-nvdimm/159457116473.754248.7879464730875147...
Dan Williams (1):
daxctl: Cleanup whitespace
Joao Martins (9):
libdaxctl: add daxctl_dev_set_size()
daxctl: add resize support in reconfigure-device
daxctl: add command to disable devdax device
daxctl: add command to enable devdax device
libdaxctl: add daxctl_region_create_dev()
daxctl: add command to create device
libdaxctl: add daxctl_region_destroy_dev()
daxctl: add command to destroy device
daxctl/test: Add tests for dynamic dax regions
Documentation/daxctl/Makefile.am | 6 +-
Documentation/daxctl/daxctl-create-device.txt | 105 +++++++
Documentation/daxctl/daxctl-destroy-device.txt | 63 +++++
Documentation/daxctl/daxctl-disable-device.txt | 58 ++++
Documentation/daxctl/daxctl-enable-device.txt | 59 ++++
Documentation/daxctl/daxctl-reconfigure-device.txt | 16 ++
daxctl/builtin.h | 4 +
daxctl/daxctl.c | 4 +
daxctl/device.c | 310 ++++++++++++++++++++-
daxctl/lib/libdaxctl.c | 67 +++++
daxctl/lib/libdaxctl.sym | 7 +
daxctl/libdaxctl.h | 3 +
test/Makefile.am | 1 +
test/daxctl-create.sh | 294 +++++++++++++++++++
util/filter.c | 2 +-
15 files changed, 993 insertions(+), 6 deletions(-)
create mode 100644 Documentation/daxctl/daxctl-create-device.txt
create mode 100644 Documentation/daxctl/daxctl-destroy-device.txt
create mode 100644 Documentation/daxctl/daxctl-disable-device.txt
create mode 100644 Documentation/daxctl/daxctl-enable-device.txt
create mode 100755 test/daxctl-create.sh
--
1.8.3.1
1 year, 5 months
Feedback requested: Exposing NVDIMM performance statistics in a generic way
by Vaibhav Jain
Hello,
I am looking for some community feedback on these two Problem-statements:
1.How to expose NVDIMM performance statistics in an arch or nvdimm vendor
agnostic manner ?
2. Is there a common set of performance statistics for NVDIMMs that all
vendors should provide ?
Problem context
===============
While working on bring up of PAPR SCM based NVDIMMs[1] for arch/powerpc
we want to expose certain dimm performance statistics like "Media
Read/Write Counts", "Power-on Seconds" etc to user-space [2]. These
performance statistics are similar to what ipmctl[3] reports for Intel®
Optane™ persistent memory via the '-show performance' command line
arg. However the reported set of performance stats doesn't cover the
entirety of all performance stats supported by PAPR SCM based NVDimms.
For example here is a subset of performance stats which are specific to
PAPR SCM NVDimms and that not reported by ipmctl:
* Controller Reset Count
* Controller Reset Elapsed Time
* Power-on Seconds
* Cache Read Hit Count
* Cache Write Hit Count
Possibility of updating ipmctl to add support for these performance
statistics is greatly hampered by no support for ACPI on Powerpc
arch. Secondly vendors who dont support ACPI/NFIT command set
similar to Intel® Optane™ (Example MSFT) are also left out in
lurch. Problem-statement#1 points to this specific problem.
Additionally in absence of any pre-agreed set of performance statistics
which all vendors should support, adding support for such a
functionality in ipmctl may not bode well of other nvdimm vendors. For
example if support for reporting "Controller Reset Count" is added to
ipmctl then it may not be applicable to other vendors such as Intel®
Optane™. This issue is what Problem-statement#2 refers to.
Possible Solution for Problem#1
===============================
One possible solution to Problem#1 can to add support for reporting
NVDIMM performance statistics in 'ndtcl'. 'libndctl' already has a layer
that abstracts underlying NVDIMM vendors (via struct ndctl_dimm_ops),
making supporting different NVDIMM vendors fairly easy. Also ndctl is
more widely used compared to 'ipmctl', hence adding such a functionality
to ndctl would make it more widely used.
Above solution was implemented as RFC patch-set[2] that exposes these
performance statistics through a generic abstraction in libndctl and
added a presentation layer for this data in ndctl[4]. It added a new
command line flags '--stat' to ndctl to report *all* nvdimm vendor
reported performance stats. The output is similar to one below:
# ndctl list -D --stats
[
{
"dev":"nmem0",
"stats":{
"Power-on Seconds":603931,
"Media Read Count":0,
"Media Write Count":6313,
}
}
]
This was done by adding two new dimm-ops callbacks that were
implemented by the papr_scm implementation within libndctl. These
callbacks are invoked by newly introduce code in 'util/json-smart.c'
that format the returned stats from these new dimm-ops and transform
them into a json-object to later presentation. I would request you to
look at RFC patch-set[2] to understand the implementation details.
Possibled Solution for Problem#2
================================
Solution to Problem-statement#2 is what eludes me though. If there is a
minimal set of performance stats (similar to what ndctl enforces for
health-stats) then implementation of such a functionality in
ndctl/ipmctl would be easy to implement. But is it really possible to
have such a common set of performance stats that NVDIMM vendors can
expose.
Patch-set[2] though tries to bypass this problem by letting the vendor
descide which performance stats to expose. This opens up a possibility
of this functionality to abused by dimm vendors to reports arbirary data
through this flag that may not be performance-stats.
Summing-up
==========
In light of above, requesting your feedback as to how
problem-statements#{1, 2} can be addressed within ndctl subsystem. Also
are these problems even worth solving.
References
==========
[1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/papr_...
[2] "[ndctl RFC-PATCH 0/4] Add support for reporting PAPR NVDIMM
Statistics"
https://lore.kernel.org/linux-nvdimm/20200518110814.145644-1-vaibhav@linu...
[3] https://docs.pmem.io/ipmctl-user-guide/instrumentation/show-device-perfor...
[4] "[RFC-PATCH 1/4] ndctl,libndctl: Implement new dimm-ops 'new_stats'
and 'get_stat'"
https://lore.kernel.org/linux-nvdimm/20200514225258.508463-2-vaibhav@linu...
Thanks,
~ Vaibhav
1 year, 6 months
[PATCH v6 0/6] mm: introduce memfd_secret system call to create "secret" memory areas
by Mike Rapoport
From: Mike Rapoport <rppt(a)linux.ibm.com>
Hi,
This is an implementation of "secret" mappings backed by a file descriptor.
I've dropped the boot time reservation patch for now as it is not strictly
required for the basic usage and can be easily added later either with or
without CMA.
v6 changes:
* Silence the warning about missing syscall, thanks to Qian Cai
* Replace spaces with tabs in Kconfig additions, per Randy
* Add a selftest.
v5 changes:
* rebase on v5.9-rc5
* drop boot time memory reservation patch
v4 changes:
* rebase on v5.9-rc1
* Do not redefine PMD_PAGE_ORDER in fs/dax.c, thanks Kirill
* Make secret mappings exclusive by default and only require flags to
memfd_secret() system call for uncached mappings, thanks again Kirill :)
v3 changes:
* Squash kernel-parameters.txt update into the commit that added the
command line option.
* Make uncached mode explicitly selectable by architectures. For now enable
it only on x86.
v2 changes:
* Follow Michael's suggestion and name the new system call 'memfd_secret'
* Add kernel-parameters documentation about the boot option
* Fix i386-tinyconfig regression reported by the kbuild bot.
CONFIG_SECRETMEM now depends on !EMBEDDED to disable it on small systems
from one side and still make it available unconditionally on
architectures that support SET_DIRECT_MAP.
The file descriptor backing secret memory mappings is created using a
dedicated memfd_secret system call The desired protection mode for the
memory is configured using flags parameter of the system call. The mmap()
of the file descriptor created with memfd_secret() will create a "secret"
memory mapping. The pages in that mapping will be marked as not present in
the direct map and will have desired protection bits set in the user page
table. For instance, current implementation allows uncached mappings.
Although normally Linux userspace mappings are protected from other users,
such secret mappings are useful for environments where a hostile tenant is
trying to trick the kernel into giving them access to other tenants
mappings.
Additionally, the secret mappings may be used as a mean to protect guest
memory in a virtual machine host.
For demonstration of secret memory usage we've created a userspace library
[1] that does two things: the first is act as a preloader for openssl to
redirect all the OPENSSL_malloc calls to secret memory meaning any secret
keys get automatically protected this way and the other thing it does is
expose the API to the user who needs it. We anticipate that a lot of the
use cases would be like the openssl one: many toolkits that deal with
secret keys already have special handling for the memory to try to give
them greater protection, so this would simply be pluggable into the
toolkits without any need for user application modification.
I've hesitated whether to continue to use new flags to memfd_create() or to
add a new system call and I've decided to use a new system call after I've
started to look into man pages update. There would have been two completely
independent descriptions and I think it would have been very confusing.
Hiding secret memory mappings behind an anonymous file allows (ab)use of
the page cache for tracking pages allocated for the "secret" mappings as
well as using address_space_operations for e.g. page migration callbacks.
The anonymous file may be also used implicitly, like hugetlb files, to
implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm
ABIs in the future.
As the fragmentation of the direct map was one of the major concerns raised
during the previous postings, I've added an amortizing cache of PMD-size
pages to each file descriptor that is used as an allocation pool for the
secret memory areas.
v5: https://lore.kernel.org/lkml/20200916073539.3552-1-rppt@kernel.org
v4: https://lore.kernel.org/lkml/20200818141554.13945-1-rppt@kernel.org
v3: https://lore.kernel.org/lkml/20200804095035.18778-1-rppt@kernel.org
v2: https://lore.kernel.org/lkml/20200727162935.31714-1-rppt@kernel.org
v1: https://lore.kernel.org/lkml/20200720092435.17469-1-rppt@kernel.org
Mike Rapoport (6):
mm: add definition of PMD_PAGE_ORDER
mmap: make mlock_future_check() global
mm: introduce memfd_secret system call to create "secret" memory areas
arch, mm: wire up memfd_secret system call were relevant
mm: secretmem: use PMD-size pages to amortize direct map fragmentation
secretmem: test: add basic selftest for memfd_secret(2)
arch/Kconfig | 7 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/arm64/include/uapi/asm/unistd.h | 1 +
arch/riscv/include/asm/unistd.h | 1 +
arch/x86/Kconfig | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
fs/dax.c | 11 +-
include/linux/pgtable.h | 3 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 7 +-
include/uapi/linux/magic.h | 1 +
include/uapi/linux/secretmem.h | 8 +
kernel/sys_ni.c | 2 +
mm/Kconfig | 4 +
mm/Makefile | 1 +
mm/internal.h | 3 +
mm/mmap.c | 5 +-
mm/secretmem.c | 333 ++++++++++++++++++++++
scripts/checksyscalls.sh | 4 +
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 3 +-
tools/testing/selftests/vm/memfd_secret.c | 296 +++++++++++++++++++
tools/testing/selftests/vm/run_vmtests | 17 ++
25 files changed, 703 insertions(+), 13 deletions(-)
create mode 100644 include/uapi/linux/secretmem.h
create mode 100644 mm/secretmem.c
create mode 100644 tools/testing/selftests/vm/memfd_secret.c
--
2.28.0
1 year, 6 months
[RFC PATCH 0/4] powerpc/papr_scm: Add support for reporting NVDIMM performance statistics
by Vaibhav Jain
The patch-set proposes to add support for fetching and reporting
performance statistics for PAPR compliant NVDIMMs as described in
documentation for H_SCM_PERFORMANCE_STATS hcall Ref[1]. The patch-set
also implements mechanisms to expose NVDIMM performance stats via
sysfs and newly introduced PDSMs[2] for libndctl.
This patch-set combined with corresponding ndctl and libndctl changes
proposed at Ref[3] should enable user to fetch PAPR compliant NVDIMMs
using following command:
# ndctl list -D --stats
[
{
"dev":"nmem0",
"stats":{
"Controller Reset Count":2,
"Controller Reset Elapsed Time":603331,
"Power-on Seconds":603931,
"Life Remaining":"100%",
"Critical Resource Utilization":"0%",
"Host Load Count":5781028,
"Host Store Count":8966800,
"Host Load Duration":975895365,
"Host Store Duration":716230690,
"Media Read Count":0,
"Media Write Count":6313,
"Media Read Duration":0,
"Media Write Duration":9679615,
"Cache Read Hit Count":5781028,
"Cache Write Hit Count":8442479,
"Fast Write Count":8969912
}
}
]
The patchset is dependent on existing patch-set "[PATCH v7 0/5]
powerpc/papr_scm: Add support for reporting nvdimm health" available
at Ref[2] that adds support for reporting PAPR compliant NVDIMMs in
'papr_scm' kernel module.
Structure of the patch-set
==========================
The patch-set starts with implementing functionality in papr_scm
module to issue H_SCM_PERFORMANCE_STATS hcall, fetch & parse dimm
performance stats and exposing them as a PAPR specific libnvdimm
attribute named 'perf_stats'
Patch-2 introduces a new PDSM named FETCH_PERF_STATS that can be
issued by libndctl asking papr_scm to issue the
H_SCM_PERFORMANCE_STATS hcall using helpers introduced earlier and
storing the results in a dimm specific perf-stats-buffer.
Patch-3 introduces a new PDSM named READ_PERF_STATS that can be
issued by libndctl to read the perf-stats-buffer in an incremental
manner to workaround the 256-bytes envelop limitation of libnvdimm.
Finally Patch-4 introduces a new PDSM named GET_PERF_STAT that can be
issued by libndctl to read values of a specific NVDIMM performance
stat like "Life Remaining".
References
==========
[1] Documentation/powerpc/papr_hcals.rst
[2] https://lore.kernel.org/linux-nvdimm/20200508104922.72565-1-vaibhav@linux...
[3] https://github.com/vaibhav92/ndctl/tree/papr_scm_stats_v1
Vaibhav Jain (4):
powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
powerpc/papr_scm: Add support for PAPR_SCM_PDSM_FETCH_PERF_STATS
powerpc/papr_scm: Implement support for PAPR_SCM_PDSM_READ_PERF_STATS
powerpc/papr_scm: Add support for PDSM GET_PERF_STAT
Documentation/ABI/testing/sysfs-bus-papr-scm | 27 ++
arch/powerpc/include/uapi/asm/papr_scm_pdsm.h | 60 +++
arch/powerpc/platforms/pseries/papr_scm.c | 391 ++++++++++++++++++
3 files changed, 478 insertions(+)
--
2.26.2
1 year, 7 months
[PATCH 0/4] Remove nrexceptional tracking
by Matthew Wilcox (Oracle)
We actually use nrexceptional for very little these days. It's a constant
source of pain with the THP patches because we don't know how large a
shadow entry is, so either we have to ask the xarray how many indices
it covers, or store that information in the shadow entry (and reduce
the amount of other information in the shadow entry proportionally).
While tracking down the most recent case of "evict tells me I've got
the accounting wrong again", I wondered if it might not be simpler to
just remove it. So here's a patch set to do just that. I think each
of these patches is an improvement in isolation, but the combination of
all four is larger than the sum of its parts.
I'm running xfstests on this patchset right now. If one of the DAX
people could try it out, that'd be fantastic.
Matthew Wilcox (Oracle) (4):
mm: Introduce and use page_cache_empty
mm: Stop accounting shadow entries
dax: Account DAX entries as nrpages
mm: Remove nrexceptional from inode
fs/block_dev.c | 2 +-
fs/dax.c | 8 ++++----
fs/inode.c | 2 +-
include/linux/fs.h | 2 --
include/linux/pagemap.h | 5 +++++
mm/filemap.c | 15 ---------------
mm/truncate.c | 19 +++----------------
mm/workingset.c | 1 -
8 files changed, 14 insertions(+), 40 deletions(-)
--
2.27.0
1 year, 7 months
[PATCH v4 0/8] selective merging of system ram resources
by David Hildenbrand
Some add_memory*() users add memory in small, contiguous memory blocks.
Examples include virtio-mem, hyper-v balloon, and the XEN balloon.
This can quickly result in a lot of memory resources, whereby the actual
resource boundaries are not of interest (e.g., it might be relevant for
DIMMs, exposed via /proc/iomem to user space). We really want to merge
added resources in this scenario where possible.
Resources are effectively stored in a list-based tree. Having a lot of
resources not only wastes memory, it also makes traversing that tree more
expensive, and makes /proc/iomem explode in size (e.g., requiring
kexec-tools to manually merge resources when creating a kdump header. The
current kexec-tools resource count limit does not allow for more than
~100GB of memory with a memory block size of 128MB on x86-64).
Let's allow to selectively merge system ram resources by specifying a
new flag for add_memory*(). Patch #5 contains a /proc/iomem example. Only
tested with virtio-mem.
v3 -> v4:
- "mm/memory_hotplug: guard more declarations by CONFIG_MEMORY_HOTPLUG"
-- Fix configs without CONFIG_MEMORY_HOTPLUG with the new mhp_t type
-- Did a buch of cross-compiles with different configs, hope there isn't
anything I missed.
v2 -> v3:
- "mm/memory_hotplug: prepare passing flags to add_memory() and friends"
-- Use proper __bitwise type for flags
-- Use "MHP_NONE" for empty flags
- Rebased to latest -next, added rb's
v1 -> v2:
- I had another look at v1 after vacation and didn't like it - it felt like
a hack. So I want forward and added a proper flag to add_memory*(), and
introduce a clean (non-racy) way to mark System RAM resources mergeable.
- "kernel/resource: move and rename IORESOURCE_MEM_DRIVER_MANAGED"
-- Clean that flag up, felt wrong in the PnP section
- "mm/memory_hotplug: prepare passing flags to add_memory() and friends"
-- Previously sent in other context - decided to keep Wei's ack
- "mm/memory_hotplug: MEMHP_MERGE_RESOURCE to specify merging of System
RAM resources"
-- Cleaner approach to get the job done by using proper flags and only
merging the single, specified resource
- "virtio-mem: try to merge system ram resources"
"xen/balloon: try to merge system ram resources"
"hv_balloon: try to merge system ram resources"
-- Use the new flag MEMHP_MERGE_RESOURCE, much cleaner
RFC -> v1:
- Switch from rather generic "merge_child_mem_resources()" where a resource
name has to be specified to "merge_system_ram_resources().
- Smaller comment/documentation/patch description changes/fixes
David Hildenbrand (8):
kernel/resource: make release_mem_region_adjustable() never fail
kernel/resource: move and rename IORESOURCE_MEM_DRIVER_MANAGED
mm/memory_hotplug: guard more declarations by CONFIG_MEMORY_HOTPLUG
mm/memory_hotplug: prepare passing flags to add_memory() and friends
mm/memory_hotplug: MEMHP_MERGE_RESOURCE to specify merging of System
RAM resources
virtio-mem: try to merge system ram resources
xen/balloon: try to merge system ram resources
hv_balloon: try to merge system ram resources
arch/powerpc/platforms/powernv/memtrace.c | 2 +-
.../platforms/pseries/hotplug-memory.c | 2 +-
drivers/acpi/acpi_memhotplug.c | 3 +-
drivers/base/memory.c | 3 +-
drivers/dax/kmem.c | 2 +-
drivers/hv/hv_balloon.c | 2 +-
drivers/s390/char/sclp_cmd.c | 2 +-
drivers/virtio/virtio_mem.c | 3 +-
drivers/xen/balloon.c | 2 +-
include/linux/ioport.h | 12 +-
include/linux/memory_hotplug.h | 35 +++---
kernel/kexec_file.c | 2 +-
kernel/resource.c | 109 ++++++++++++++----
mm/memory_hotplug.c | 47 +++-----
mm/sparse.c | 2 +
15 files changed, 151 insertions(+), 77 deletions(-)
--
2.26.2
1 year, 7 months