[PATCH 1/2] libnvdimm, pfn: use size is enough
by Wei Yang
When trying to see whether current nd_region intersects with others, we
have already calculated the *size* to be expanded to SECTION size.
So just pass size is enough.
Signed-off-by: Wei Yang <richardw.yang(a)linux.intel.com>
---
drivers/nvdimm/pfn_devs.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index becf0bb481b3..5eca050b3660 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -686,7 +686,7 @@ static void trim_pfn_device(struct nd_pfn *nd_pfn, u32 *start_pad, u32 *end_trun
if (region_intersects(start, size, IORESOURCE_SYSTEM_RAM,
IORES_DESC_NONE) == REGION_MIXED
|| !IS_ALIGNED(end, nd_pfn->align)
- || nd_region_conflict(nd_region, start, size + adjust))
+ || nd_region_conflict(nd_region, start, size))
*end_trunc = end - phys_pmem_align_down(nd_pfn, end);
}
--
2.19.1
1 year, 11 months
[PATCH v3 0/5] kvm "virtio pmem" device
by Pankaj Gupta
This patch series has implementation for "virtio pmem".
"virtio pmem" is fake persistent memory(nvdimm) in guest
which allows to bypass the guest page cache. This also
implements a VIRTIO based asynchronous flush mechanism.
Sharing guest kernel driver in this patchset with the
changes suggested in v2. Tested with Qemu side device
emulation for virtio-pmem [6].
Details of project idea for 'virtio pmem' flushing interface
is shared [3] & [4].
Implementation is divided into two parts:
New virtio pmem guest driver and qemu code changes for new
virtio pmem paravirtualized device.
1. Guest virtio-pmem kernel driver
---------------------------------
- Reads persistent memory range from paravirt device and
registers with 'nvdimm_bus'.
- 'nvdimm/pmem' driver uses this information to allocate
persistent memory region and setup filesystem operations
to the allocated memory.
- virtio pmem driver implements asynchronous flushing
interface to flush from guest to host.
2. Qemu virtio-pmem device
---------------------------------
- Creates virtio pmem device and exposes a memory range to
KVM guest.
- At host side this is file backed memory which acts as
persistent memory.
- Qemu side flush uses aio thread pool API's and virtio
for asynchronous guest multi request handling.
David Hildenbrand CCed also posted a modified version[6] of
qemu virtio-pmem code based on updated Qemu memory device API.
Virtio-pmem errors handling:
----------------------------------------
Checked behaviour of virtio-pmem for below types of errors
Need suggestions on expected behaviour for handling these errors?
- Hardware Errors: Uncorrectable recoverable Errors:
a] virtio-pmem:
- As per current logic if error page belongs to Qemu process,
host MCE handler isolates(hwpoison) that page and send SIGBUS.
Qemu SIGBUS handler injects exception to KVM guest.
- KVM guest then isolates the page and send SIGBUS to guest
userspace process which has mapped the page.
b] Existing implementation for ACPI pmem driver:
- Handles such errors with MCE notifier and creates a list
of bad blocks. Read/direct access DAX operation return EIO
if accessed memory page fall in bad block list.
- It also starts backgound scrubbing.
- Similar functionality can be reused in virtio-pmem with MCE
notifier but without scrubbing(no ACPI/ARS)? Need inputs to
confirm if this behaviour is ok or needs any change?
Changes from PATCH v2: [1]
- Disable MAP_SYNC for ext4 & XFS filesystems - [Dan]
- Use name 'virtio pmem' in place of 'fake dax'
Changes from PATCH v1: [2]
- 0-day build test for build dependency on libnvdimm
Changes suggested by - [Dan Williams]
- Split the driver into two parts virtio & pmem
- Move queuing of async block request to block layer
- Add "sync" parameter in nvdimm_flush function
- Use indirect call for nvdimm_flush
- Don’t move declarations to common global header e.g nd.h
- nvdimm_flush() return 0 or -EIO if it fails
- Teach nsio_rw_bytes() that the flush can fail
- Rename nvdimm_flush() to generic_nvdimm_flush()
- Use 'nd_region->provider_data' for long dereferencing
- Remove virtio_pmem_freeze/restore functions
- Remove BSD license text with SPDX license text
- Add might_sleep() in virtio_pmem_flush - [Luiz]
- Make spin_lock_irqsave() narrow
Changes from RFC v3
- Rebase to latest upstream - Luiz
- Call ndregion->flush in place of nvdimm_flush- Luiz
- kmalloc return check - Luiz
- virtqueue full handling - Stefan
- Don't map entire virtio_pmem_req to device - Stefan
- request leak, correct sizeof req- Stefan
- Move declaration to virtio_pmem.c
Changes from RFC v2:
- Add flush function in the nd_region in place of switching
on a flag - Dan & Stefan
- Add flush completion function with proper locking and wait
for host side flush completion - Stefan & Dan
- Keep userspace API in uapi header file - Stefan, MST
- Use LE fields & New device id - MST
- Indentation & spacing suggestions - MST & Eric
- Remove extra header files & add licensing - Stefan
Changes from RFC v1:
- Reuse existing 'pmem' code for registering persistent
memory and other operations instead of creating an entirely
new block driver.
- Use VIRTIO driver to register memory information with
nvdimm_bus and create region_type accordingly.
- Call VIRTIO flush from existing pmem driver.
Pankaj Gupta (5):
libnvdimm: nd_region flush callback support
virtio-pmem: Add virtio-pmem guest driver
libnvdimm: add nd_region buffered dax_dev flag
ext4: disable map_sync for virtio pmem
xfs: disable map_sync for virtio pmem
[2] https://lkml.org/lkml/2018/8/31/407
[3] https://www.spinics.net/lists/kvm/msg149761.html
[4] https://www.spinics.net/lists/kvm/msg153095.html
[5] https://lkml.org/lkml/2018/8/31/413
[6] https://marc.info/?l=qemu-devel&m=153555721901824&w=2
drivers/acpi/nfit/core.c | 4 -
drivers/dax/super.c | 17 +++++
drivers/nvdimm/claim.c | 6 +
drivers/nvdimm/nd.h | 1
drivers/nvdimm/pmem.c | 15 +++-
drivers/nvdimm/region_devs.c | 45 +++++++++++++-
drivers/nvdimm/virtio_pmem.c | 84 ++++++++++++++++++++++++++
drivers/virtio/Kconfig | 10 +++
drivers/virtio/Makefile | 1
drivers/virtio/pmem.c | 125 +++++++++++++++++++++++++++++++++++++++
fs/ext4/file.c | 11 +++
fs/xfs/xfs_file.c | 8 ++
include/linux/dax.h | 9 ++
include/linux/libnvdimm.h | 11 +++
include/linux/virtio_pmem.h | 60 ++++++++++++++++++
include/uapi/linux/virtio_ids.h | 1
include/uapi/linux/virtio_pmem.h | 10 +++
17 files changed, 406 insertions(+), 12 deletions(-)
1 year, 11 months
[ndctl PATCH 2/2] libndctl: NVDIMM_FAMILY_HYPERV: add .smart_get_shutdown_count (Function 2)
by Dexuan Cui
With the patch, "ndctl list --dimms --health --idle" can show
"shutdown_count" now, e.g.
{
"dev":"nmem0",
"id":"04d5-01-1701-00000000",
"handle":0,
"phys_id":0,
"health":{
"health_state":"ok",
"shutdown_count":2
}
}
The patch has to directly call ndctl_cmd_submit() in
hyperv_cmd_smart_get_flags() and hyperv_cmd_smart_get_shutdown_count() to
get the needed info, because util_dimm_health_to_json() only submits one
command, and unluckily for Hyper-V Virtual NVDIMM we need to call both
Function 1 and 2 to get the needed info.
My feeling is that it's not good to directly call ndctl_cmd_submit(), but
doing this requires no change to the common code, and I'm unsure if it's
better to change the common code just for Hyper-V.
Signed-off-by: Dexuan Cui <decui(a)microsoft.com>
---
ndctl/lib/hyperv.c | 62 ++++++++++++++++++++++++++++++++++++++++------
ndctl/lib/hyperv.h | 7 ++++++
2 files changed, 62 insertions(+), 7 deletions(-)
diff --git a/ndctl/lib/hyperv.c b/ndctl/lib/hyperv.c
index b303d50..e8ec142 100644
--- a/ndctl/lib/hyperv.c
+++ b/ndctl/lib/hyperv.c
@@ -22,7 +22,8 @@
#define CMD_HYPERV_STATUS(_c) (CMD_HYPERV(_c)->u.status)
#define CMD_HYPERV_SMART_DATA(_c) (CMD_HYPERV(_c)->u.smart.data)
-static struct ndctl_cmd *hyperv_dimm_cmd_new_smart(struct ndctl_dimm *dimm)
+static struct ndctl_cmd *hyperv_dimm_cmd_new_cmd(struct ndctl_dimm *dimm,
+ unsigned int command)
{
struct ndctl_bus *bus = ndctl_dimm_get_bus(dimm);
struct ndctl_ctx *ctx = ndctl_bus_get_ctx(bus);
@@ -35,8 +36,7 @@ static struct ndctl_cmd *hyperv_dimm_cmd_new_smart(struct ndctl_dimm *dimm)
return NULL;
}
- if (test_dimm_dsm(dimm, ND_HYPERV_CMD_GET_HEALTH_INFO) ==
- DIMM_DSM_UNSUPPORTED) {
+ if (test_dimm_dsm(dimm, command) == DIMM_DSM_UNSUPPORTED) {
dbg(ctx, "unsupported function\n");
return NULL;
}
@@ -54,7 +54,7 @@ static struct ndctl_cmd *hyperv_dimm_cmd_new_smart(struct ndctl_dimm *dimm)
hyperv = CMD_HYPERV(cmd);
hyperv->gen.nd_family = NVDIMM_FAMILY_HYPERV;
- hyperv->gen.nd_command = ND_HYPERV_CMD_GET_HEALTH_INFO;
+ hyperv->gen.nd_command = command;
hyperv->gen.nd_fw_size = 0;
hyperv->gen.nd_size_in = offsetof(struct nd_hyperv_smart, status);
hyperv->gen.nd_size_out = sizeof(hyperv->u.smart);
@@ -65,34 +65,74 @@ static struct ndctl_cmd *hyperv_dimm_cmd_new_smart(struct ndctl_dimm *dimm)
return cmd;
}
-static int hyperv_smart_valid(struct ndctl_cmd *cmd)
+static struct ndctl_cmd *hyperv_dimm_cmd_new_smart(struct ndctl_dimm *dimm)
+{
+ return hyperv_dimm_cmd_new_cmd(dimm, ND_HYPERV_CMD_GET_HEALTH_INFO);
+}
+
+static int hyperv_cmd_valid(struct ndctl_cmd *cmd, unsigned int command)
{
if (cmd->type != ND_CMD_CALL ||
cmd->size != sizeof(*cmd) + sizeof(struct nd_pkg_hyperv) ||
CMD_HYPERV(cmd)->gen.nd_family != NVDIMM_FAMILY_HYPERV ||
- CMD_HYPERV(cmd)->gen.nd_command != ND_HYPERV_CMD_GET_HEALTH_INFO ||
+ CMD_HYPERV(cmd)->gen.nd_command != command ||
cmd->status != 0 ||
CMD_HYPERV_STATUS(cmd) != 0)
return cmd->status < 0 ? cmd->status : -EINVAL;
return 0;
}
+static int hyperv_smart_valid(struct ndctl_cmd *cmd)
+{
+ return hyperv_cmd_valid(cmd, ND_HYPERV_CMD_GET_HEALTH_INFO);
+}
+
static int hyperv_cmd_xlat_firmware_status(struct ndctl_cmd *cmd)
{
return CMD_HYPERV_STATUS(cmd) == 0 ? 0 : -EINVAL;
}
+static int hyperv_get_shutdown_count(struct ndctl_cmd *cmd,
+ unsigned int *count)
+{
+ unsigned int command = ND_HYPERV_CMD_GET_SHUTDOWN_INFO;
+ struct ndctl_cmd *cmd_get_shutdown_info;
+ int rc;
+
+ cmd_get_shutdown_info = hyperv_dimm_cmd_new_cmd(cmd->dimm, command);
+ if (!cmd_get_shutdown_info)
+ return -EINVAL;
+
+ if (ndctl_cmd_submit(cmd_get_shutdown_info) < 0 ||
+ hyperv_cmd_valid(cmd_get_shutdown_info, command) < 0) {
+ rc = -EINVAL;
+ goto out;
+ }
+
+ *count = CMD_HYPERV(cmd_get_shutdown_info)->u.shutdown_info.count;
+ rc = 0;
+out:
+ ndctl_cmd_unref(cmd_get_shutdown_info);
+ return rc;
+}
+
static unsigned int hyperv_cmd_smart_get_flags(struct ndctl_cmd *cmd)
{
int rc;
+ unsigned int count;
+ unsigned int flags = 0;
rc = hyperv_smart_valid(cmd);
if (rc < 0) {
errno = -rc;
return 0;
}
+ flags |= ND_SMART_HEALTH_VALID;
- return ND_SMART_HEALTH_VALID;
+ if (hyperv_get_shutdown_count(cmd, &count) == 0)
+ flags |= ND_SMART_SHUTDOWN_COUNT_VALID;
+
+ return flags;
}
static unsigned int hyperv_cmd_smart_get_health(struct ndctl_cmd *cmd)
@@ -121,9 +161,17 @@ static unsigned int hyperv_cmd_smart_get_health(struct ndctl_cmd *cmd)
return health;
}
+static unsigned int hyperv_cmd_smart_get_shutdown_count(struct ndctl_cmd *cmd)
+{
+ unsigned int count;
+
+ return hyperv_get_shutdown_count(cmd, &count) == 0 ? count : UINT_MAX;
+}
+
struct ndctl_dimm_ops * const hyperv_dimm_ops = &(struct ndctl_dimm_ops) {
.new_smart = hyperv_dimm_cmd_new_smart,
.smart_get_flags = hyperv_cmd_smart_get_flags,
.smart_get_health = hyperv_cmd_smart_get_health,
+ .smart_get_shutdown_count = hyperv_cmd_smart_get_shutdown_count,
.xlat_firmware_status = hyperv_cmd_xlat_firmware_status,
};
diff --git a/ndctl/lib/hyperv.h b/ndctl/lib/hyperv.h
index 8e55a97..5232d60 100644
--- a/ndctl/lib/hyperv.h
+++ b/ndctl/lib/hyperv.h
@@ -19,6 +19,7 @@ enum {
/* non-root commands */
ND_HYPERV_CMD_GET_HEALTH_INFO = 1,
+ ND_HYPERV_CMD_GET_SHUTDOWN_INFO = 2,
};
/*
@@ -38,9 +39,15 @@ struct nd_hyperv_smart {
};
} __attribute__((packed));
+struct nd_hyperv_shutdown_info {
+ __u32 status;
+ __u32 count;
+} __attribute__((packed));
+
union nd_hyperv_cmd {
__u32 status;
struct nd_hyperv_smart smart;
+ struct nd_hyperv_shutdown_info shutdown_info;
} __attribute__((packed));
struct nd_pkg_hyperv {
--
2.19.1
1 year, 11 months
[ndctl PATCH 1/2] libndctl: add support for NVDIMM_FAMILY_HYPERV's _DSM Function 1
by Dexuan Cui
This patch retrieves the health info by Hyper-V _DSM method Function 1:
Get Health Information (Function Index 1)
See http://www.uefi.org/RFIC_LIST ("Virtual NVDIMM 0x1901").
Now "ndctl list --dimms --health --idle" can show a line "health_state":"ok",
e.g.
{
"dev":"nmem0",
"id":"04d5-01-1701-00000000",
"handle":0,
"phys_id":0,
"health":{
"health_state":"ok"
}
}
If there is an error with the NVDIMM, the "ok" will be replaced with "unknown",
"fatal", "critical", or "non-critical".
Hyper-V also supports "Get Unsafe Shutdown Count (Function Index 2)", but
unluckily util_dimm_health_to_json() only submits *one* command, so we don't
have a chance to call Function 2 here, and hence we can't show a line
"shutdown_count" in the output. If a user in a Linux virtual machine running
on Hyper-V is interested in the Unsafe Shutdown Count, please directly run:
"cat /sys/bus/nd/devices/nmem*/nfit/dirty_shutdown" (there is a pending patch
submitted to the kernel for this sysfs node).
Signed-off-by: Dexuan Cui <decui(a)microsoft.com>
---
ndctl/lib/Makefile.am | 1 +
ndctl/lib/hyperv.c | 129 ++++++++++++++++++++++++++++++++++++++++++
ndctl/lib/hyperv.h | 51 +++++++++++++++++
ndctl/lib/libndctl.c | 2 +
ndctl/lib/private.h | 3 +
ndctl/ndctl.h | 1 +
6 files changed, 187 insertions(+)
create mode 100644 ndctl/lib/hyperv.c
create mode 100644 ndctl/lib/hyperv.h
diff --git a/ndctl/lib/Makefile.am b/ndctl/lib/Makefile.am
index 7797039..fb75fda 100644
--- a/ndctl/lib/Makefile.am
+++ b/ndctl/lib/Makefile.am
@@ -20,6 +20,7 @@ libndctl_la_SOURCES =\
intel.c \
hpe1.c \
msft.c \
+ hyperv.c \
ars.c \
firmware.c \
libndctl.c
diff --git a/ndctl/lib/hyperv.c b/ndctl/lib/hyperv.c
new file mode 100644
index 0000000..b303d50
--- /dev/null
+++ b/ndctl/lib/hyperv.c
@@ -0,0 +1,129 @@
+/*
+ * Copyright (c) 2019, Microsoft Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU Lesser General Public License,
+ * version 2.1, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT ANY
+ * WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+ * FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for
+ * more details.
+ */
+#include <stdlib.h>
+#include <limits.h>
+#include <util/bitmap.h>
+#include <util/log.h>
+#include <ndctl/libndctl.h>
+#include "private.h"
+#include "hyperv.h"
+
+#define CMD_HYPERV(_c) ((_c)->hyperv)
+#define CMD_HYPERV_STATUS(_c) (CMD_HYPERV(_c)->u.status)
+#define CMD_HYPERV_SMART_DATA(_c) (CMD_HYPERV(_c)->u.smart.data)
+
+static struct ndctl_cmd *hyperv_dimm_cmd_new_smart(struct ndctl_dimm *dimm)
+{
+ struct ndctl_bus *bus = ndctl_dimm_get_bus(dimm);
+ struct ndctl_ctx *ctx = ndctl_bus_get_ctx(bus);
+ struct ndctl_cmd *cmd;
+ size_t size;
+ struct nd_pkg_hyperv *hyperv;
+
+ if (!ndctl_dimm_is_cmd_supported(dimm, ND_CMD_CALL)) {
+ dbg(ctx, "unsupported cmd\n");
+ return NULL;
+ }
+
+ if (test_dimm_dsm(dimm, ND_HYPERV_CMD_GET_HEALTH_INFO) ==
+ DIMM_DSM_UNSUPPORTED) {
+ dbg(ctx, "unsupported function\n");
+ return NULL;
+ }
+
+ size = sizeof(*cmd) + sizeof(struct nd_pkg_hyperv);
+ cmd = calloc(1, size);
+ if (!cmd)
+ return NULL;
+
+ cmd->dimm = dimm;
+ ndctl_cmd_ref(cmd);
+ cmd->type = ND_CMD_CALL;
+ cmd->size = size;
+ cmd->status = 1;
+
+ hyperv = CMD_HYPERV(cmd);
+ hyperv->gen.nd_family = NVDIMM_FAMILY_HYPERV;
+ hyperv->gen.nd_command = ND_HYPERV_CMD_GET_HEALTH_INFO;
+ hyperv->gen.nd_fw_size = 0;
+ hyperv->gen.nd_size_in = offsetof(struct nd_hyperv_smart, status);
+ hyperv->gen.nd_size_out = sizeof(hyperv->u.smart);
+ hyperv->u.smart.status = 0;
+
+ cmd->firmware_status = &hyperv->u.smart.status;
+
+ return cmd;
+}
+
+static int hyperv_smart_valid(struct ndctl_cmd *cmd)
+{
+ if (cmd->type != ND_CMD_CALL ||
+ cmd->size != sizeof(*cmd) + sizeof(struct nd_pkg_hyperv) ||
+ CMD_HYPERV(cmd)->gen.nd_family != NVDIMM_FAMILY_HYPERV ||
+ CMD_HYPERV(cmd)->gen.nd_command != ND_HYPERV_CMD_GET_HEALTH_INFO ||
+ cmd->status != 0 ||
+ CMD_HYPERV_STATUS(cmd) != 0)
+ return cmd->status < 0 ? cmd->status : -EINVAL;
+ return 0;
+}
+
+static int hyperv_cmd_xlat_firmware_status(struct ndctl_cmd *cmd)
+{
+ return CMD_HYPERV_STATUS(cmd) == 0 ? 0 : -EINVAL;
+}
+
+static unsigned int hyperv_cmd_smart_get_flags(struct ndctl_cmd *cmd)
+{
+ int rc;
+
+ rc = hyperv_smart_valid(cmd);
+ if (rc < 0) {
+ errno = -rc;
+ return 0;
+ }
+
+ return ND_SMART_HEALTH_VALID;
+}
+
+static unsigned int hyperv_cmd_smart_get_health(struct ndctl_cmd *cmd)
+{
+ unsigned int health = 0;
+ __u32 num;
+ int rc;
+
+ rc = hyperv_smart_valid(cmd);
+ if (rc < 0) {
+ errno = -rc;
+ return UINT_MAX;
+ }
+
+ num = CMD_HYPERV_SMART_DATA(cmd)->health & 0x3F;
+
+ if (num & (BIT(0) | BIT(1)))
+ health |= ND_SMART_CRITICAL_HEALTH;
+
+ if (num & BIT(2))
+ health |= ND_SMART_FATAL_HEALTH;
+
+ if (num & (BIT(3) | BIT(4) | BIT(5)))
+ health |= ND_SMART_NON_CRITICAL_HEALTH;
+
+ return health;
+}
+
+struct ndctl_dimm_ops * const hyperv_dimm_ops = &(struct ndctl_dimm_ops) {
+ .new_smart = hyperv_dimm_cmd_new_smart,
+ .smart_get_flags = hyperv_cmd_smart_get_flags,
+ .smart_get_health = hyperv_cmd_smart_get_health,
+ .xlat_firmware_status = hyperv_cmd_xlat_firmware_status,
+};
diff --git a/ndctl/lib/hyperv.h b/ndctl/lib/hyperv.h
new file mode 100644
index 0000000..8e55a97
--- /dev/null
+++ b/ndctl/lib/hyperv.h
@@ -0,0 +1,51 @@
+/*
+ * Copyright (c) 2019, Microsoft Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU Lesser General Public License,
+ * version 2.1, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT ANY
+ * WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+ * FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for
+ * more details.
+ */
+#ifndef __NDCTL_HYPERV_H__
+#define __NDCTL_HYPERV_H__
+
+/* See http://www.uefi.org/RFIC_LIST ("Virtual NVDIMM 0x1901") */
+enum {
+ ND_HYPERV_CMD_QUERY = 0,
+
+ /* non-root commands */
+ ND_HYPERV_CMD_GET_HEALTH_INFO = 1,
+};
+
+/*
+ * This is actually Function 1's data,
+ * This is the closest I can find to match the "smart".
+ * Hyper-V _DSM methods don't have a smart function.
+ */
+struct nd_hyperv_smart_data {
+ __u32 health;
+} __attribute__((packed));
+
+struct nd_hyperv_smart {
+ __u32 status;
+ union {
+ __u8 buf[4];
+ struct nd_hyperv_smart_data data[0];
+ };
+} __attribute__((packed));
+
+union nd_hyperv_cmd {
+ __u32 status;
+ struct nd_hyperv_smart smart;
+} __attribute__((packed));
+
+struct nd_pkg_hyperv {
+ struct nd_cmd_pkg gen;
+ union nd_hyperv_cmd u;
+} __attribute__((packed));
+
+#endif /* __NDCTL_HYPERV_H__ */
diff --git a/ndctl/lib/libndctl.c b/ndctl/lib/libndctl.c
index c9e2875..48bdb27 100644
--- a/ndctl/lib/libndctl.c
+++ b/ndctl/lib/libndctl.c
@@ -1492,6 +1492,8 @@ static void *add_dimm(void *parent, int id, const char *dimm_base)
dimm->ops = hpe1_dimm_ops;
if (dimm->cmd_family == NVDIMM_FAMILY_MSFT)
dimm->ops = msft_dimm_ops;
+ if (dimm->cmd_family == NVDIMM_FAMILY_HYPERV)
+ dimm->ops = hyperv_dimm_ops;
sprintf(path, "%s/nfit/dsm_mask", dimm_base);
if (sysfs_read_attr(ctx, path, buf) == 0)
diff --git a/ndctl/lib/private.h b/ndctl/lib/private.h
index a387b0b..a9d35c5 100644
--- a/ndctl/lib/private.h
+++ b/ndctl/lib/private.h
@@ -31,6 +31,7 @@
#include "intel.h"
#include "hpe1.h"
#include "msft.h"
+#include "hyperv.h"
struct nvdimm_data {
struct ndctl_cmd *cmd_read;
@@ -270,6 +271,7 @@ struct ndctl_cmd {
struct nd_cmd_pkg pkg[0];
struct ndn_pkg_hpe1 hpe1[0];
struct ndn_pkg_msft msft[0];
+ struct nd_pkg_hyperv hyperv[0];
struct nd_pkg_intel intel[0];
struct nd_cmd_get_config_size get_size[0];
struct nd_cmd_get_config_data_hdr get_data[0];
@@ -344,6 +346,7 @@ struct ndctl_dimm_ops {
struct ndctl_dimm_ops * const intel_dimm_ops;
struct ndctl_dimm_ops * const hpe1_dimm_ops;
struct ndctl_dimm_ops * const msft_dimm_ops;
+struct ndctl_dimm_ops * const hyperv_dimm_ops;
static inline struct ndctl_bus *cmd_to_bus(struct ndctl_cmd *cmd)
{
diff --git a/ndctl/ndctl.h b/ndctl/ndctl.h
index c6aaa4c..008f81c 100644
--- a/ndctl/ndctl.h
+++ b/ndctl/ndctl.h
@@ -262,6 +262,7 @@ struct nd_cmd_pkg {
#define NVDIMM_FAMILY_HPE1 1
#define NVDIMM_FAMILY_HPE2 2
#define NVDIMM_FAMILY_MSFT 3
+#define NVDIMM_FAMILY_HYPERV 4
#define ND_IOCTL_CALL _IOWR(ND_IOCTL, ND_CMD_CALL,\
struct nd_cmd_pkg)
--
2.19.1
1 year, 11 months
[PATCH v3 0/5] Optimize writecache when using pmem as cache
by Huaisheng Ye
From: Huaisheng Ye <yehs1(a)lenovo.com>
This patch set could be used for dm-writecache when use persistent
memory as cache data device.
Patch 1 and 2 go towards removing unused parameter and codes which
actually doesn't really work.
Patch 3 and 4 are targeted at solving problem fn ctr failed to work
due to invalid magic or version, which is caused by the super block
of pmem has messy data stored.
Patch 5 is used for getting the status of seq_count.
Changes Since v2:
- seq_count is important for flush operations, output it within status
for debugging and analyzing code behavior.
[1]: https://lkml.org/lkml/2019/1/3/43
[2]: https://lkml.org/lkml/2019/1/9/6
Huaisheng Ye (5):
dm-writecache: remove unused size to writecache_flush_region
dm-writecache: get rid of memory_data flush to writecache_flush_entry
dm-writecache: expand pmem_reinit for struct dm_writecache
Documentation/device-mapper: add optional parameter reinit
dm-writecache: output seq_count within status
Documentation/device-mapper/writecache.txt | 4 ++++
drivers/md/dm-writecache.c | 23 +++++++++++++----------
2 files changed, 17 insertions(+), 10 deletions(-)
--
1.8.3.1
1 year, 11 months
Security Notice. Someone have access to your system.
by linux-nvdimm@lists.01.org
I'll begin with the most important.
I hacked your device and then got access to all your accounts... Including linux-nvdimm(a)lists.01.org.
It is easy to check - I wrote you this email from your account.
Moreover, I know your intim secret, and I have proof of this.
You do not know me personally, and no one paid me to check you.
It is just a coincidence that I discovered your mistake.
In fact, I posted a malicious code (exploit) to an adult site, and you visited this site...
While watching a video Trojan virus has been installed on your device through an exploit.
This darknet software working as RDP (remote-controlled desktop), which has a keylogger,
which gave me access to your microphone and webcam.
Soon after, my software received all your contacts from your messenger, social network and email.
At that moment I spent much more time than I should have.
I studied your love life and created a good video series.
The first part shows the video that you watched,
and the second part shows the video clip taken from your webcam (you are doing inappropriate things).
Honestly, I want to forget all the information about you and allow you to continue your daily life.
And I will give you two suitable options. Both are easy to do.
First option: you ignore this email.
The second option: you pay me $750(USD).
Let's look at 2 options in detail.
The first option is to ignore this email.
Let me tell you what happens if you choose this path.
I will send your video to your contacts, including family members, colleagues, etc.
This does not protect you from the humiliation that you and
your family need to know when friends and family members know about your unpleasant details.
The second option is to pay me. We will call this "privacy advice."
Now let me tell you what happens if you choose this path.
Your secret is your secret. I immediately destroy the video.
You continue your life as if none of this has happened.
Now you might think: "I'll call to police!"
Undoubtedly, I have taken steps to ensure that this letter cannot be traced to me,
and it will not remain aloof from the evidence of the destruction of your daily life.
I don't want to steal all your savings.
I just want to get compensation for my efforts that I put in to investigate you.
Let us hope that you decide to create all this in full and pay me a fee for confidentiality.
You make a Bitcoin payment (if you don't know how to do it, just enter "how to buy bitcoins" in Google search)
Shipping amount: $750(USD).
Getting Bitcoin Addresses: 1GF8J1XRaiX2oHM7SQo9VAFAtWZcRgMncg
(This is sensitive, so copy and paste it carefully)
Don't tell anyone what to use bitcoins for. The procedure for obtaining bitcoins can take several days, so do not wait.
I have a spetial code in Trojan, and now I know that you have read this letter.
You have 48 hours to pay.
If I don't get BitCoins, I'll send your video to your contacts, including close relatives, co-workers, and so on.
Start looking for the best excuse for friends and family before they all know.
But if I get paid, I immediately delete the video.
This is a one-time offer that is non-negotiable, so do not waste my and your time.
Time is running out.
Bye!
1 year, 11 months
Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
by Dan Williams
On Wed, Feb 6, 2019 at 5:57 PM Doug Ledford <dledford(a)redhat.com> wrote:
[..]
> > > > Dave, you said the FS is responsible to arbitrate access to the
> > > > physical pages..
> > > >
> > > > Is it possible to have a filesystem for DAX that is more suited to
> > > > this environment? Ie designed to not require block reallocation (no
> > > > COW, no reflinks, different approach to ftruncate, etc)
> > >
> > > Can someone give me a real world scenario that someone is *actually*
> > > asking for with this?
> >
> > I'll point to this example. At the 6:35 mark Kodi talks about the
> > Oracle use case for DAX + RDMA.
> >
> > https://youtu.be/ywKPPIE8JfQ?t=395
>
> Thanks for the link, I'll review the panel.
>
> > Currently the only way to get this to work is to use ODP capable
> > hardware, or Device-DAX. Device-DAX is a facility to map persistent
> > memory statically through device-file. It's great for statically
> > allocated use cases, but loses all the nice things (provisioning,
> > permissions, naming) that a filesystem gives you. This debate is what
> > to do about non-ODP capable hardware and Filesystem-DAX facility. The
> > current answer is "no RDMA for you".
> >
> > > Are DAX users demanding xfs, or is it just the
> > > filesystem of convenience?
> >
> > xfs is the only Linux filesystem that supports DAX and reflink.
>
> Is it going to be clear from the link above why reflink + DAX + RDMA is
> a good/desirable thing?
>
No, unfortunately it will only clarify the DAX + RDMA use case, but
you don't need to look very far to see that the trend for storage
management is more COW / reflink / thin-provisioning etc in more
places. Users want the flexibility to be able delay, change, and
consolidate physical storage allocation decisions, otherwise
device-dax would have solved all these problems and we would not be
having this conversation.
> > > Do they need to stick with xfs?
> >
> > Can you clarify the motivation for that question?
>
> I did a little googling and research before I asked that question.
> According to the documentation, other FSes can work with DAX too (namely
> ext2 and ext4). The question was more or less pondering whether or not
> ext2 or ext4 + RDMA + DAX would solve people's problems without the
> issues that xfs brings.
No, ext4 also supports hole punch, and the ext2 support is a toy. We
went through quite a bit of work to solve this problem for the
O_DIRECT pinned page case.
6b2bb7265f0b sched/wait: Introduce wait_var_event()
d6dc57e251a4 xfs, dax: introduce xfs_break_dax_layouts()
69eb5fa10eb2 xfs: prepare xfs_break_layouts() for another layout type
c63a8eae63d3 xfs: prepare xfs_break_layouts() to be called with
XFS_MMAPLOCK_EXCL
5fac7408d828 mm, fs, dax: handle layout changes to pinned dax mappings
b1f382178d15 ext4: close race between direct IO and ext4_break_layouts()
430657b6be89 ext4: handle layout changes to pinned DAX mappings
cdbf8897cb09 dax: dax_layout_busy_page() warn on !exceptional
So the fs is prepared to notify RDMA applications of the need to
evacuate a mapping (layout change), and the timeout to respond to that
notification can be configured by the administrator. The debate is
about what to do when the platform owner needs to get a mapping out of
the way in bounded time.
> > This problem exists
> > for any filesystem that implements an mmap that where the physical
> > page backing the mapping is identical to the physical storage location
> > for the file data. I don't see it as an xfs specific problem. Rather,
> > xfs is taking the lead in this space because it has already deployed
> > and demonstrated that leases work for the pnfs4 block-server case, so
> > it seems logical to attempt to extend that case for non-ODP-RDMA.
> >
> > > Are they
> > > really trying to do COW backed mappings for the RDMA targets? Or do
> > > they want a COW backed FS but are perfectly happy if the specific RDMA
> > > targets are *not* COW and are statically allocated?
> >
> > I would expect the COW to be broken at registration time. Only ODP
> > could possibly support reflink + RDMA. So I think this devolves the
> > problem back to just the "what to do about truncate/punch-hole"
> > problem in the specific case of non-ODP hardware combined with the
> > Filesystem-DAX facility.
>
> If that's the case, then we are back to EBUSY *could* work (despite the
> objections made so far).
I linked it in my response to Jason [1], but the entire reason ext2,
ext4, and xfs scream "experimental" when DAX is enabled is because DAX
makes typical flows fail that used to work in the page-cache backed
mmap case. The failure of a data space management command like
fallocate(punch_hole) is more risky than just not allowing the memory
registration to happen in the first place. Leases result in a system
that has a chance at making forward progress.
The current state of disallowing RDMA for FS-DAX is one of the "if
(dax) goto fail;" conditions that needs to be solved before filesystem
developers graduate DAX from experimental status.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2019-February/019884.html
1 year, 11 months
Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
by Dan Williams
On Wed, Feb 6, 2019 at 3:41 PM Jason Gunthorpe <jgg(a)ziepe.ca> wrote:
[..]
> > You're describing the current situation, i.e. Linux already implements
> > this, it's called Device-DAX and some users of RDMA find it
> > insufficient. The choices are to continue to tell them "no", or say
> > "yes, but you need to submit to lease coordination".
>
> Device-DAX is not what I'm imagining when I say XFS--.
>
> I mean more like XFS with all features that require rellocation of
> blocks disabled.
>
> Forbidding hold punch, reflink, cow, etc, doesn't devolve back to
> device-dax.
True, not all the way, but the distinction loses significance as you
lose fs features.
Filesystems mark DAX functionality experimental [1] precisely because
it forbids otherwise typical operations that work in the nominal page
cache case. An approach that says "lets cement the list of things a
filesystem or a core-memory-mangement facility can't do because RDMA
finds it awkward" is bad precedent. It's bad precedent because it
abdicates core kernel functionality to userspace and weakens the api
contract in surprising ways.
EBUSY is a horrible status code especially if an administrator is
presented with an emergency situation that a filesystem needs to free
up storage capacity and get established memory registrations out of
the way. The motivation for the current status quo of failing memory
registration for DAX mappings is to help ensure the system does not
get into this situation where forward progress cannot be guaranteed.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2019-February/019884.html
1 year, 11 months
question about mmap MAP_PRIVATE on PMEM/DAX/fs files
by Larry Bassel
Is mmaping a PMEM/DAX/fs file MAP_PRIVATE supported? Is it something
that people are likely to want to do?
If it is supported, suppose I open a file in PMEM/DAX/fs, mmap it
MAP_PRIVATE, read from the memory mapped file (with memory accesses,
not the read syscall) and take a page fault which the kernel satisfies.
At this time do my page tables for the private mmaped page(s) point to the
PMEM corresponding to the file and the kernel will wait until
the page(s) is/are altered (either by me or someone else) to
copy on write and give me a different page/mapping?
Or does the kernel avoid this by always mapping a copy of the
page(s) involved in the private mmap in the first place?
In either case, is my private copy going to come from PMEM or is it
an "ordinary" page, or is this "random"? Does the program have
any choice in this (i.e. suppose I want to make sure my copied
page is persistent)?
Thanks.
Larry
1 year, 11 months
[PATCH] libnvdimm: Fix altmap reservation size calculation
by Oliver O'Halloran
Libnvdimm reserves the first 8K of pfn and devicedax namespaces to
store a superblock describing the namespace. This 8K reservation
is contained within the altmap area which the kernel uses for the
vmemmap backing for the pages within the namespace. The altmap
allows for some pages at the start of the altmap area to be reserved
and that mechanism is used to protect the superblock from being
re-used as vmemmap backing.
The number of PFNs to reserve is calculated using:
PHYS_PFN(SZ_8K)
Which is implemented as:
#define PHYS_PFN(x) ((unsigned long)((x) >> PAGE_SHIFT))
So on systems where PAGE_SIZE is greater than 8K the reservation
size is truncated to zero and the superblock area is re-used as
vmemmap backing. As a result all the namespace information stored
in the superblock (i.e. if it's a PFN or DAX namespace) is lost
and the namespace needs to be re-created to get access to the
contents.
This patch fixes this by using PFN_UP() rather than PHYS_PFN() to ensure
that at least one page is reserved. On systems with a 4K pages size this
patch should have no effect.
Cc: stable(a)vger.kernel.org
Cc: Dan Williams <dan.j.williams(a)intel.com>
Fixes: ac515c084be9 ("libnvdimm, pmem, pfn: move pfn setup to the core")
Signed-off-by: Oliver O'Halloran <oohall(a)gmail.com>
---
---
drivers/nvdimm/pfn_devs.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 6f22272e8d80..9b9be83da0e7 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -593,7 +593,7 @@ static unsigned long init_altmap_base(resource_size_t base)
static unsigned long init_altmap_reserve(resource_size_t base)
{
- unsigned long reserve = PHYS_PFN(SZ_8K);
+ unsigned long reserve = PFN_UP(SZ_8K);
unsigned long base_pfn = PHYS_PFN(base);
reserve += base_pfn - PFN_SECTION_ALIGN_DOWN(base_pfn);
--
2.20.1
1 year, 11 months