[PATCH] dax: add a 'modalias' attribute to DAX 'bus' devices
by Vishal Verma
Add a 'modalias' attribute to devices under the DAX bus so that userspace
is able to dynamically load modules as needed. The modalias already
exists, it was only the sysfs attribute that was missing.
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma(a)intel.com>
---
drivers/dax/bus.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 28c3324271ac..2109cfe80219 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -295,6 +295,17 @@ static ssize_t target_node_show(struct device *dev,
}
static DEVICE_ATTR_RO(target_node);
+static ssize_t modalias_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ /*
+ * We only ever expect to handle device-dax instances, i.e. the
+ * @type argument to MODULE_ALIAS_DAX_DEVICE() is always zero
+ */
+ return sprintf(buf, DAX_DEVICE_MODALIAS_FMT "\n", 0);
+}
+static DEVICE_ATTR_RO(modalias);
+
static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
{
struct device *dev = container_of(kobj, struct device, kobj);
@@ -306,6 +317,7 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
}
static struct attribute *dev_dax_attributes[] = {
+ &dev_attr_modalias.attr,
&dev_attr_size.attr,
&dev_attr_target_node.attr,
NULL,
--
2.20.1
3 years, 2 months
[RFC PATCH] pmem: advertise page alignment for pmem devices supporting fsdax
by Darrick J. Wong
Hi all!
Uh, we have an internal customer <cough> who's been trying out MAP_SYNC
on pmem, and they've observed that one has to do a fair amount of
legwork (in the form of mkfs.xfs parameters) to get the kernel to set up
2M PMD mappings. They (of course) want to mmap hundreds of GB of pmem,
so the PMD mappings are much more efficient.
I started poking around w.r.t. what mkfs.xfs was doing and realized that
if the fsdax pmem device advertised iomin/ioopt of 2MB, then mkfs will
set up all the parameters automatically. Below is my ham-handed attempt
to teach the kernel to do this.
Comments, flames, "WTF is this guy smoking?" are all welcome. :)
--D
---
Configure pmem devices to advertise the default page alignment when said
block device supports fsdax. Certain filesystems use these iomin/ioopt
hints to try to create aligned file extents, which makes it much easier
for mmaps to take advantage of huge page table entries.
Signed-off-by: Darrick J. Wong <darrick.wong(a)oracle.com>
---
drivers/nvdimm/pmem.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index bc2f700feef8..3eeb9dd117d5 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -441,8 +441,11 @@ static int pmem_attach_disk(struct device *dev,
blk_queue_logical_block_size(q, pmem_sector_size(ndns));
blk_queue_max_hw_sectors(q, UINT_MAX);
blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
- if (pmem->pfn_flags & PFN_MAP)
+ if (pmem->pfn_flags & PFN_MAP) {
blk_queue_flag_set(QUEUE_FLAG_DAX, q);
+ blk_queue_io_min(q, PFN_DEFAULT_ALIGNMENT);
+ blk_queue_io_opt(q, PFN_DEFAULT_ALIGNMENT);
+ }
q->queuedata = pmem;
disk = alloc_disk_node(0, nid);
3 years, 2 months
[PATCH 0/7] libnvdimm/pfn: Fix section-alignment padding
by Dan Williams
Lately Linux has encountered platforms that collide Persistent Memory
regions between each other, specifically cases where ->start_pad needed
to be non-zero. This lead to commit ae86cbfef381 "libnvdimm, pfn: Pad
pfn namespaces relative to other regions". That commit allowed
namespaces to be mapped with devm_memremap_pages(). However dax
operations on those configurations currently fail if attempted within the
->start_pad range because pmem_device->data_offset was still relative to
raw resource base not relative to the section aligned resource range
mapped by devm_memremap_pages().
Luckily __bdev_dax_supported() caught these failures and simply disabled
dax. However, to fix this situation a non-backwards compatible change
needs to be made to the interpretation of the nd_pfn info-block.
->start_pad needs to be accounted in ->map.map_offset (formerly
->data_offset), and ->map.map_base (formerly ->phys_addr) needs to be
adjusted to the section aligned resource base used to establish
->map.map formerly (formerly ->virt_addr).
See patch 7 "libnvdimm/pfn: Fix 'start_pad' implementation" for more
details, and the ndctl patch series "Improve support + testing for
labels + info-blocks" for the corresponding regression test.
---
Dan Williams (7):
libnvdimm/pfn: Account for PAGE_SIZE > info-block-size in nd_pfn_init()
libnvdimm/pmem: Honor force_raw for legacy pmem regions
dax: Check the end of the block-device capacity with dax_direct_access()
libnvdimm/pfn: Introduce super-block minimum version requirements
libnvdimm/pfn: Remove dax_label_reserve
libnvdimm/pfn: Introduce 'struct pfn_map_info'
libnvdimm/pfn: Fix 'start_pad' implementation
drivers/dax/pmem.c | 9 +-
drivers/dax/super.c | 39 ++++++--
drivers/nvdimm/namespace_devs.c | 4 +
drivers/nvdimm/nd.h | 15 +++
drivers/nvdimm/pfn.h | 4 +
drivers/nvdimm/pfn_devs.c | 181 ++++++++++++++++++++++++++++-----------
drivers/nvdimm/pmem.c | 111 +++++++++++-------------
drivers/nvdimm/pmem.h | 12 ---
tools/testing/nvdimm/pmem-dax.c | 15 ++-
9 files changed, 244 insertions(+), 146 deletions(-)
3 years, 2 months
question about page tables in DAX/FS/PMEM case
by Larry Bassel
I'm working on sharing page tables in the DAX/XFS/PMEM/PMD case.
If multiple processes would use the identical page of PMDs corresponding
to a 1 GiB address range of DAX/XFS/PMEM/PMDs, presumably one can instead
of populating a new PUD, just atomically increment a refcount and point
to the same PUD in the next level above.
i.e.
OLD:
process 1:
VA -> levels of page tables -> PUD1 -> page of PMDs1
process 2:
VA -> levels of page tables -> PUD2 -> page of PMDs2
NEW:
process 1:
VA -> levels of page tables -> PUD1 -> page of PMDs1
process 2:
VA -> levels of page tables -> PUD1 -> page of PMDs1 (refcount 2)
There are several cases to consider:
1. New mapping
OLD:
make a new PUD, populate the associated page of PMDs
(at least partially) with PMD entries.
NEW:
same
2. Mapping by a process same (same VA->PA and size and protections, etc.)
as one that already exists
OLD:
make a new PUD, populate the associated page of PMDs
(at least partially) with PMD entries.
NEW:
use same PUD, increase refcount (potentially even if this mapping is private
in which case there may eventually be a copy-on-write -- see #5 below)
3. Unmapping of a mapping which is the same as that from another process
OLD:
destroy the process's copy of mapping, free PUD, etc.
NEW:
decrease refcount, only if now 0 do we destroy mapping, etc.
4. Unmapping of a mapping which is unique (refcount 1)
OLD:
destroy the process's copy of mapping, free PUD, etc.
NEW:
same
5. Mapping was private (but same as another process), process writes
OLD:
break the PMD into PTEs, destroy PMD mapping, free PUD, etc..
NEW:
decrease refcount, only if now 0 do we destroy mapping, etc.
we still break the PMD into PTEs.
If I have a mmap of a DAX/FS/PMEM file and I take
a page (either pte or PMD sized) fault on access to this file,
the page table(s) are set up in dax_iomap_fault() in fs/dax.c (correct?).
If the process later munmaps this file or exits but there are still
other users of the shared page of PMDs, I would need to
detect that this has happened and act accordingly (#3 above)
Where will these page table entries be torn down?
In the same code where any other page table is torn down?
If this is the case, what would the cleanest way of telling that these
page tables (PMDs, etc.) correspond to a DAX/FS/PMEM mapping
(look at the physical address pointed to?) so that
I could do the right thing here.
I understand that I may have missed something obvious here.
Thanks.
Larry
3 years, 2 months
[PATCH v3 1/2] nfit, mce: only handle uncorrectable machine checks
by Vishal Verma
The mce handler for 'nfit' devices is called for memory errors on a
Non-Volatile DIMM, and adds the error location to a 'badblocks' list.
This list is used by the various NVDIMM drivers to avoid consuming known
poison locations during IO.
The mce handler gets called for both corrected and uncorrectable errors.
Until now, both kinds of errors have been added to the badblocks list.
However, corrected memory errors indicate that the problem has already
been fixed by hardware, and the resulting interrupt is merely a
notification to Linux. As far as future accesses to that location are
concerned, it is perfectly fine to use, and thus doesn't need to be
included in the above badblocks list.
Add a check in the nfit mce handler to filter out corrected mce events,
and only process uncorrectable errors.
Reported-by: Omar Avelar <omar.avelar(a)intel.com>
Fixes: 6839a6d96f4e ("nfit: do an ARS scrub on hitting a latent media error")
Cc: stable(a)vger.kernel.org
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Tony Luck <tony.luck(a)intel.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Signed-off-by: Vishal Verma <vishal.l.verma(a)intel.com>
---
arch/x86/include/asm/mce.h | 1 +
arch/x86/kernel/cpu/mcheck/mce.c | 3 ++-
drivers/acpi/nfit/mce.c | 4 ++--
3 files changed, 5 insertions(+), 3 deletions(-)
v3: Unchanged from v2
diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 3a17107594c8..3111b3cee2ee 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -216,6 +216,7 @@ static inline int umc_normaddr_to_sysaddr(u64 norm_addr, u16 nid, u8 umc, u64 *s
int mce_available(struct cpuinfo_x86 *c);
bool mce_is_memory_error(struct mce *m);
+bool mce_is_correctable(struct mce *m);
DECLARE_PER_CPU(unsigned, mce_exception_count);
DECLARE_PER_CPU(unsigned, mce_poll_count);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 953b3ce92dcc..27015948bc41 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -534,7 +534,7 @@ bool mce_is_memory_error(struct mce *m)
}
EXPORT_SYMBOL_GPL(mce_is_memory_error);
-static bool mce_is_correctable(struct mce *m)
+bool mce_is_correctable(struct mce *m)
{
if (m->cpuvendor == X86_VENDOR_AMD && m->status & MCI_STATUS_DEFERRED)
return false;
@@ -544,6 +544,7 @@ static bool mce_is_correctable(struct mce *m)
return true;
}
+EXPORT_SYMBOL_GPL(mce_is_correctable);
static bool cec_add_mce(struct mce *m)
{
diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
index e9626bf6ca29..7a51707f87e9 100644
--- a/drivers/acpi/nfit/mce.c
+++ b/drivers/acpi/nfit/mce.c
@@ -25,8 +25,8 @@ static int nfit_handle_mce(struct notifier_block *nb, unsigned long val,
struct acpi_nfit_desc *acpi_desc;
struct nfit_spa *nfit_spa;
- /* We only care about memory errors */
- if (!mce_is_memory_error(mce))
+ /* We only care about uncorrectable memory errors */
+ if (!mce_is_memory_error(mce) || mce_is_correctable(mce))
return NOTIFY_DONE;
/*
--
2.17.1
3 years, 2 months
[PATCH] libnvdimm, region: use struct_size() in kzalloc()
by Gustavo A. R. Silva
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct foo {
int stuff;
struct boo entry[];
};
instance = kzalloc(sizeof(struct foo) + count * sizeof(struct boo), GFP_KERNEL);
Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:
instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL);
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo(a)embeddedor.com>
---
drivers/nvdimm/region_devs.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index e2818f94f292..d36cb5df9683 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1020,10 +1020,9 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
}
region_buf = ndbr;
} else {
- nd_region = kzalloc(sizeof(struct nd_region)
- + sizeof(struct nd_mapping)
- * ndr_desc->num_mappings,
- GFP_KERNEL);
+ nd_region = kzalloc(struct_size(nd_region, mapping,
+ ndr_desc->num_mappings),
+ GFP_KERNEL);
region_buf = nd_region;
}
--
2.20.1
3 years, 2 months
[PATCH] acpi/nfit: Fix bus command validation
by Dan Williams
Commit 11189c1089da "acpi/nfit: Fix command-supported detection" broke
ND_CMD_CALL for bus-level commands. The "func = cmd" assumption is only
valid for:
ND_CMD_ARS_CAP
ND_CMD_ARS_START
ND_CMD_ARS_STATUS
ND_CMD_CLEAR_ERROR
The function number otherwise needs to be pulled from the command
payload for:
NFIT_CMD_TRANSLATE_SPA
NFIT_CMD_ARS_INJECT_SET
NFIT_CMD_ARS_INJECT_CLEAR
NFIT_CMD_ARS_INJECT_GET
Update cmd_to_func() for the bus case and call it in the common path.
Fixes: 11189c1089da ("acpi/nfit: Fix command-supported detection")
Cc: <stable(a)vger.kernel.org>
Cc: Vishal Verma <vishal.verma(a)intel.com>
Reported-by: Grzegorz Burzynski <grzegorz.burzynski(a)intel.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
drivers/acpi/nfit/core.c | 22 ++++++++++++----------
1 file changed, 12 insertions(+), 10 deletions(-)
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index e18ade5d74e9..c34c595d6bb0 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -415,7 +415,7 @@ static int cmd_to_func(struct nfit_mem *nfit_mem, unsigned int cmd,
if (call_pkg) {
int i;
- if (nfit_mem->family != call_pkg->nd_family)
+ if (nfit_mem && nfit_mem->family != call_pkg->nd_family)
return -ENOTTY;
for (i = 0; i < ARRAY_SIZE(call_pkg->nd_reserved2); i++)
@@ -424,6 +424,10 @@ static int cmd_to_func(struct nfit_mem *nfit_mem, unsigned int cmd,
return call_pkg->nd_command;
}
+ /* In the !call_pkg case, bus commands == bus functions */
+ if (!nfit_mem)
+ return cmd;
+
/* Linux ND commands == NVDIMM_FAMILY_INTEL function numbers */
if (nfit_mem->family == NVDIMM_FAMILY_INTEL)
return cmd;
@@ -454,17 +458,18 @@ int acpi_nfit_ctl(struct nvdimm_bus_descriptor *nd_desc, struct nvdimm *nvdimm,
if (cmd_rc)
*cmd_rc = -EINVAL;
+ if (cmd == ND_CMD_CALL)
+ call_pkg = buf;
+ func = cmd_to_func(nfit_mem, cmd, call_pkg);
+ if (func < 0)
+ return func;
+
if (nvdimm) {
struct acpi_device *adev = nfit_mem->adev;
if (!adev)
return -ENOTTY;
- if (cmd == ND_CMD_CALL)
- call_pkg = buf;
- func = cmd_to_func(nfit_mem, cmd, call_pkg);
- if (func < 0)
- return func;
dimm_name = nvdimm_name(nvdimm);
cmd_name = nvdimm_cmd_name(cmd);
cmd_mask = nvdimm_cmd_mask(nvdimm);
@@ -475,12 +480,9 @@ int acpi_nfit_ctl(struct nvdimm_bus_descriptor *nd_desc, struct nvdimm *nvdimm,
} else {
struct acpi_device *adev = to_acpi_dev(acpi_desc);
- func = cmd;
cmd_name = nvdimm_bus_cmd_name(cmd);
cmd_mask = nd_desc->cmd_mask;
- dsm_mask = cmd_mask;
- if (cmd == ND_CMD_CALL)
- dsm_mask = nd_desc->bus_dsm_mask;
+ dsm_mask = nd_desc->bus_dsm_mask;
desc = nd_cmd_bus_desc(cmd);
guid = to_nfit_uuid(NFIT_DEV_BUS);
handle = adev->handle;
3 years, 2 months
[PATCH] device-dax: Add a 'target_node' attribute
by Dan Williams
The target-node attribute is the Linux numa-node that a device-dax
instance may create when it is online. Prior to being online the
device's 'numa_node' property reflects the closest online cpu node which
is the typical expectation of a device 'numa_node'. Once it is online it
becomes its own distinct numa node, i.e. 'target_node'.
Export the 'target_node' property to give userspace tooling the ability
to predict the effective numa-node from a device-dax instance configured
to provide 'System RAM' capacity.
Cc: Vishal Verma <vishal.l.verma(a)intel.com>
Reported-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
drivers/dax/bus.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index a410154d75fb..28c3324271ac 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -279,13 +279,41 @@ static ssize_t size_show(struct device *dev,
}
static DEVICE_ATTR_RO(size);
+static int dev_dax_target_node(struct dev_dax *dev_dax)
+{
+ struct dax_region *dax_region = dev_dax->region;
+
+ return dax_region->target_node;
+}
+
+static ssize_t target_node_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct dev_dax *dev_dax = to_dev_dax(dev);
+
+ return sprintf(buf, "%d\n", dev_dax_target_node(dev_dax));
+}
+static DEVICE_ATTR_RO(target_node);
+
+static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
+{
+ struct device *dev = container_of(kobj, struct device, kobj);
+ struct dev_dax *dev_dax = to_dev_dax(dev);
+
+ if (a == &dev_attr_target_node.attr && dev_dax_target_node(dev_dax) < 0)
+ return 0;
+ return a->mode;
+}
+
static struct attribute *dev_dax_attributes[] = {
&dev_attr_size.attr,
+ &dev_attr_target_node.attr,
NULL,
};
static const struct attribute_group dev_dax_attribute_group = {
.attrs = dev_dax_attributes,
+ .is_visible = dev_dax_visible,
};
static const struct attribute_group *dax_attribute_groups[] = {
3 years, 2 months