KASAN vs ZONE_DEVICE (was: Re: [PATCH v2 2/7] dax: change bdev_dax_supported()...)
by Dan Williams
On Mon, Jun 4, 2018 at 8:32 PM, Dan Williams <dan.j.williams(a)intel.com> wrote:
> [ adding KASAN devs...]
>
> On Mon, Jun 4, 2018 at 4:40 PM, Dan Williams <dan.j.williams(a)intel.com> wrote:
>> On Sun, Jun 3, 2018 at 6:48 PM, Dan Williams <dan.j.williams(a)intel.com> wrote:
>>> On Sun, Jun 3, 2018 at 5:25 PM, Dave Chinner <david(a)fromorbit.com> wrote:
>>>> On Mon, Jun 04, 2018 at 08:20:38AM +1000, Dave Chinner wrote:
>>>>> On Thu, May 31, 2018 at 09:02:52PM -0700, Dan Williams wrote:
>>>>> > On Thu, May 31, 2018 at 7:24 PM, Dave Chinner <david(a)fromorbit.com> wrote:
>>>>> > > On Thu, May 31, 2018 at 06:57:33PM -0700, Dan Williams wrote:
>>>>> > >> > FWIW, XFS+DAX used to just work on this setup (I hadn't even
>>>>> > >> > installed ndctl until this morning!) but after changing the kernel
>>>>> > >> > it no longer works. That would make it a regression, yes?
>>>>>
>>>>> [....]
>>>>>
>>>>> > >> I suspect your kernel does not have CONFIG_ZONE_DEVICE enabled which
>>>>> > >> has the following dependencies:
>>>>> > >>
>>>>> > >> depends on MEMORY_HOTPLUG
>>>>> > >> depends on MEMORY_HOTREMOVE
>>>>> > >> depends on SPARSEMEM_VMEMMAP
>>>>> > >
>>>>> > > Filesystem DAX now has a dependency on memory hotplug?
>>>>>
>>>>> [....]
>>>>>
>>>>> > > OK, works now I've found the magic config incantantions to turn
>>>>> > > everything I now need on.
>>>>>
>>>>> By enabling these options, my test VM now has a ~30s pause in the
>>>>> boot very soon after the nvdimm subsystem is initialised.
>>>>>
>>>>> [ 1.523718] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
>>>>> [ 1.550353] 00:05: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
>>>>> [ 1.552175] Non-volatile memory driver v1.3
>>>>> [ 2.332045] tsc: Refined TSC clocksource calibration: 2199.909 MHz
>>>>> [ 2.333280] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x1fb5dcd4620, max_idle_ns: 440795264143 ns
>>>>> [ 37.217453] brd: module loaded
>>>>> [ 37.225423] loop: module loaded
>>>>> [ 37.228441] virtio_blk virtio2: [vda] 10485760 512-byte logical blocks (5.37 GB/5.00 GiB)
>>>>> [ 37.245418] virtio_blk virtio3: [vdb] 146800640 512-byte logical blocks (75.2 GB/70.0 GiB)
>>>>> [ 37.255794] virtio_blk virtio4: [vdc] 1073741824000 512-byte logical blocks (550 TB/500 TiB)
>>>>> [ 37.265403] nd_pmem namespace1.0: unable to guarantee persistence of writes
>>>>> [ 37.265618] nd_pmem namespace0.0: unable to guarantee persistence of writes
>>>>>
>>>>> The system does not appear to be consuming CPU, but it is blocking
>>>>> NMIs so I can't get a CPU trace. For a VM that I rely on booting in
>>>>> a few seconds because I reboot it tens of times a day, this is a
>>>>> problem....
>>>>
>>>> And when I turn on KASAN, the kernel fails to boot to a login prompt
>>>> because:
>>>
>>> What's your qemu and kernel command line? I'll take look at this first
>>> thing tomorrow.
>>
>> I was able to reproduce this crash by just turning on KASAN...
>> investigating. It would still help to have your config for our own
>> regression testing purposes it makes sense for us to prioritize
>> "Dave's test config", similar to the priority of not breaking Linus'
>> laptop.
>
> I believe this is a bug in KASAN, or a bug in devm_memremap_pages(),
> depends on your point of view. At the very least it is a mismatch of
> assumptions. KASAN learns of hot added memory via the memory hotplug
> notifier. However, the devm_memremap_pages() implementation is
> intentionally limited to the "first half" of the memory hotplug
> procedure. I.e. it does just enough to setup the linear map for
> pfn_to_page() and initialize the "struct page" memmap, but then stops
> short of onlining the pages. This is why we are getting a NULL ptr
> deref and not a KASAN report, because KASAN has no shadow area setup
> for the linearly mapped pmem range.
>
> In terms of solving it we could refactor kasan_mem_notifier() so that
> devm_memremap_pages() can call it outside of the notifier... I'll give
> this a shot.
Well, the attached patch got me slightly further, but only slightly...
[ 14.998394] BUG: KASAN: unknown-crash in pmem_do_bvec+0x19e/0x790 [nd_pmem]
[ 15.000006] Read of size 4096 at addr ffff880200000000 by task
systemd-udevd/915
[ 15.001991]
[ 15.002590] CPU: 15 PID: 915 Comm: systemd-udevd Tainted: G
OE 4.17.0-rc5+ #1
982
[ 15.004783] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.11.1-0-g0551a
4be2c-prebuilt.qemu-project.org 04/01/2014
[ 15.007652] Call Trace:
[ 15.008339] dump_stack+0x9a/0xeb
[ 15.009344] print_address_description+0x73/0x280
[ 15.010524] kasan_report+0x258/0x380
[ 15.011528] ? pmem_do_bvec+0x19e/0x790 [nd_pmem]
[ 15.012747] memcpy+0x1f/0x50
[ 15.013659] pmem_do_bvec+0x19e/0x790 [nd_pmem]
...I've exhausted my limited kasan internals knowledge, any ideas what
it's missing?
2 years, 8 months
[ndctl PATCH v3] ndctl, list: display the 'map' location in listings
by Vishal Verma
For 'fsdax' and 'devdax' namespaces, a 'map' location may be specified
for page structures storage. This can be 'mem', for system RAM, or 'dev'
for using pmem as the backing storage. Once set, there was no way of
telling using ndctl, which of the two locations a namespace was
configured for. Add this in util_namespace_to_json so that all
namespace listings contain the map location.
Reported-by: "Yigal Korman" <yigal.korman(a)netapp.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma(a)intel.com>
---
util/json.c | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)
v3: Further clean up by only setting 'loc' in the switch statement, and
creating the 'jobj' for 'map' later (Dan)
diff --git a/util/json.c b/util/json.c
index c606e1c..94ed948 100644
--- a/util/json.c
+++ b/util/json.c
@@ -666,7 +666,13 @@ struct json_object *util_namespace_to_json(struct ndctl_namespace *ndns,
unsigned long flags)
{
struct json_object *jndns = json_object_new_object();
+ enum ndctl_pfn_loc loc = NDCTL_PFN_LOC_NONE;
struct json_object *jobj, *jbbs = NULL;
+ const char *locations[] = {
+ [NDCTL_PFN_LOC_NONE] = "none",
+ [NDCTL_PFN_LOC_RAM] = "mem",
+ [NDCTL_PFN_LOC_PMEM] = "dev",
+ };
unsigned long long size = ULLONG_MAX;
unsigned int sector_size = UINT_MAX;
enum ndctl_namespace_mode mode;
@@ -693,10 +699,13 @@ struct json_object *util_namespace_to_json(struct ndctl_namespace *ndns,
mode = ndctl_namespace_get_mode(ndns);
switch (mode) {
case NDCTL_NS_MODE_MEMORY:
- if (pfn) /* dynamic memory mode */
+ if (pfn) { /* dynamic memory mode */
size = ndctl_pfn_get_size(pfn);
- else /* native/static memory mode */
+ loc = ndctl_pfn_get_location(pfn);
+ } else { /* native/static memory mode */
size = ndctl_namespace_get_size(ndns);
+ loc = NDCTL_PFN_LOC_RAM;
+ }
jobj = json_object_new_string("fsdax");
break;
case NDCTL_NS_MODE_DAX:
@@ -704,6 +713,7 @@ struct json_object *util_namespace_to_json(struct ndctl_namespace *ndns,
goto err;
size = ndctl_dax_get_size(dax);
jobj = json_object_new_string("devdax");
+ loc = ndctl_dax_get_location(dax);
break;
case NDCTL_NS_MODE_SAFE:
if (!btt)
@@ -721,6 +731,12 @@ struct json_object *util_namespace_to_json(struct ndctl_namespace *ndns,
if (jobj)
json_object_object_add(jndns, "mode", jobj);
+ if ((mode != NDCTL_NS_MODE_SAFE) && (mode != NDCTL_NS_MODE_RAW)) {
+ jobj = json_object_new_string(locations[loc]);
+ if (jobj)
+ json_object_object_add(jndns, "map", jobj);
+ }
+
if (size < ULLONG_MAX) {
jobj = util_json_object_size(size, flags);
if (jobj)
--
2.17.0
2 years, 8 months
[ndctl PATCH v2] ndctl, list: display the 'map' location in listings
by Vishal Verma
For 'fsdax' and 'devdax' namespaces, a 'map' location may be specified
for page structures storage. This can be 'mem', for system RAM, or 'dev'
for using pmem as the backing storage. Once set, there was no way of
telling using ndctl, which of the two locations a namespace was
configured for. Add this in util_namespace_to_json so that all
namespace listings contain the map location.
Reported-by: "Yigal Korman" <yigal.korman(a)netapp.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma(a)intel.com>
---
util/json.c | 32 +++++++++++++++++++++++++++-----
1 file changed, 27 insertions(+), 5 deletions(-)
v2: Also account for memmap=ss!nn or legacy-e820 namespaces. (Dan)
diff --git a/util/json.c b/util/json.c
index c606e1c..b020300 100644
--- a/util/json.c
+++ b/util/json.c
@@ -667,11 +667,17 @@ struct json_object *util_namespace_to_json(struct ndctl_namespace *ndns,
{
struct json_object *jndns = json_object_new_object();
struct json_object *jobj, *jbbs = NULL;
+ const char *locations[] = {
+ [NDCTL_PFN_LOC_NONE] = "none",
+ [NDCTL_PFN_LOC_RAM] = "mem",
+ [NDCTL_PFN_LOC_PMEM] = "dev",
+ };
unsigned long long size = ULLONG_MAX;
unsigned int sector_size = UINT_MAX;
enum ndctl_namespace_mode mode;
const char *bdev = NULL, *name;
unsigned int bb_count = 0;
+ enum ndctl_pfn_loc loc;
struct ndctl_btt *btt;
struct ndctl_pfn *pfn;
struct ndctl_dax *dax;
@@ -693,33 +699,49 @@ struct json_object *util_namespace_to_json(struct ndctl_namespace *ndns,
mode = ndctl_namespace_get_mode(ndns);
switch (mode) {
case NDCTL_NS_MODE_MEMORY:
- if (pfn) /* dynamic memory mode */
+ jobj = json_object_new_string("fsdax");
+ if (jobj)
+ json_object_object_add(jndns, "mode", jobj);
+ loc = ndctl_pfn_get_location(pfn);
+ if (pfn) { /* dynamic memory mode */
size = ndctl_pfn_get_size(pfn);
- else /* native/static memory mode */
+ jobj = json_object_new_string(locations[loc]);
+ } else { /* native/static memory mode */
size = ndctl_namespace_get_size(ndns);
- jobj = json_object_new_string("fsdax");
+ jobj = json_object_new_string("mem");
+ }
+ if (jobj)
+ json_object_object_add(jndns, "map", jobj);
break;
case NDCTL_NS_MODE_DAX:
if (!dax)
goto err;
size = ndctl_dax_get_size(dax);
jobj = json_object_new_string("devdax");
+ if (jobj)
+ json_object_object_add(jndns, "mode", jobj);
+ loc = ndctl_dax_get_location(dax);
+ jobj = json_object_new_string(locations[loc]);
+ if (jobj)
+ json_object_object_add(jndns, "map", jobj);
break;
case NDCTL_NS_MODE_SAFE:
if (!btt)
goto err;
jobj = json_object_new_string("sector");
+ if (jobj)
+ json_object_object_add(jndns, "mode", jobj);
size = ndctl_btt_get_size(btt);
break;
case NDCTL_NS_MODE_RAW:
size = ndctl_namespace_get_size(ndns);
jobj = json_object_new_string("raw");
+ if (jobj)
+ json_object_object_add(jndns, "mode", jobj);
break;
default:
jobj = NULL;
}
- if (jobj)
- json_object_object_add(jndns, "mode", jobj);
if (size < ULLONG_MAX) {
jobj = util_json_object_size(size, flags);
--
2.17.0
2 years, 8 months
[RFC v2 0/2] kvm "fake DAX" device flushing
by Pankaj Gupta
This is RFC V2 for 'fake DAX' flushing interface sharing
for review. This patchset has two main parts:
- Guest virtio-pmem driver
Guest driver reads persistent memory range from paravirt
device and registers with 'nvdimm_bus'. 'nvdimm/pmem'
driver uses this information to allocate persistent
memory range. Also, we have implemented guest side of
VIRTIO flushing interface.
- Qemu virtio-pmem device
It exposes a persistent memory range to KVM guest which
at host side is file backed memory and works as persistent
memory device. In addition to this it provides virtio
device handling of flushing interface. KVM guest performs
Qemu side asynchronous sync using this interface.
Changes from previous RFC[1]:
- Reuse existing 'pmem' code for registering persistent
memory and other operations instead of creating an entirely
new block driver.
- Use VIRTIO driver to register memory information with
nvdimm_bus and create region_type accordingly.
- Call VIRTIO flush from existing pmem driver.
Details of project idea for 'fake DAX' flushing interface is
shared [2] & [3].
Pankaj Gupta (2):
Add virtio-pmem guest driver
pmem: device flush over VIRTIO
[1] https://marc.info/?l=linux-mm&m=150782346802290&w=2
[2] https://www.spinics.net/lists/kvm/msg149761.html
[3] https://www.spinics.net/lists/kvm/msg153095.html
drivers/nvdimm/region_devs.c | 7 ++
drivers/virtio/Kconfig | 12 +++
drivers/virtio/Makefile | 1
drivers/virtio/virtio_pmem.c | 118 +++++++++++++++++++++++++++++++++++++++
include/linux/libnvdimm.h | 4 +
include/uapi/linux/virtio_ids.h | 1
include/uapi/linux/virtio_pmem.h | 58 +++++++++++++++++++
7 files changed, 201 insertions(+)
2 years, 8 months
[PATCH] acpi, nfit: Remove ecc_unit_size
by Dan Williams
The "Clear Error Unit" may be smaller than the ECC unit size on some
devices. For example, poison may be tracked at 64-byte alignment even
though the ECC unit is larger. Unless / until the ACPI specification
provides a non-ambiguous way to communicate this property do not expose
this to userspace.
Software that had been using this property must already be prepared for
the case where the property is not provided on older kernels, so it is
safe to remove this attribute.
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
Documentation/ABI/removed/sysfs-bus-nfit | 17 +++++++++++++++++
Documentation/ABI/testing/sysfs-bus-nfit | 19 -------------------
drivers/acpi/nfit/core.c | 11 -----------
3 files changed, 17 insertions(+), 30 deletions(-)
create mode 100644 Documentation/ABI/removed/sysfs-bus-nfit
diff --git a/Documentation/ABI/removed/sysfs-bus-nfit b/Documentation/ABI/removed/sysfs-bus-nfit
new file mode 100644
index 000000000000..ae8c1ca53828
--- /dev/null
+++ b/Documentation/ABI/removed/sysfs-bus-nfit
@@ -0,0 +1,17 @@
+What: /sys/bus/nd/devices/regionX/nfit/ecc_unit_size
+Date: Aug, 2017
+KernelVersion: v4.14 (Removed v4.18)
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Size of a write request to a DIMM that will not incur a
+ read-modify-write cycle at the memory controller.
+
+ When the nfit driver initializes it runs an ARS (Address Range
+ Scrub) operation across every pmem range. Part of that process
+ involves determining the ARS capabilities of a given address
+ range. One of the capabilities that is reported is the 'Clear
+ Uncorrectable Error Range Length Unit Size' (see: ACPI 6.2
+ section 9.20.7.4 Function Index 1 - Query ARS Capabilities).
+ This property indicates the boundary at which the NVDIMM may
+ need to perform read-modify-write cycles to maintain ECC (Error
+ Correcting Code) blocks.
diff --git a/Documentation/ABI/testing/sysfs-bus-nfit b/Documentation/ABI/testing/sysfs-bus-nfit
index 619eb8ca0f99..a1cb44dcb908 100644
--- a/Documentation/ABI/testing/sysfs-bus-nfit
+++ b/Documentation/ABI/testing/sysfs-bus-nfit
@@ -212,22 +212,3 @@ Description:
range. Used by NVDIMM Region Mapping Structure to uniquely refer
to this structure. Value of 0 is reserved and not used as an
index.
-
-
-What: /sys/bus/nd/devices/regionX/nfit/ecc_unit_size
-Date: Aug, 2017
-KernelVersion: v4.14
-Contact: linux-nvdimm(a)lists.01.org
-Description:
- (RO) Size of a write request to a DIMM that will not incur a
- read-modify-write cycle at the memory controller.
-
- When the nfit driver initializes it runs an ARS (Address Range
- Scrub) operation across every pmem range. Part of that process
- involves determining the ARS capabilities of a given address
- range. One of the capabilities that is reported is the 'Clear
- Uncorrectable Error Range Length Unit Size' (see: ACPI 6.2
- section 9.20.7.4 Function Index 1 - Query ARS Capabilities).
- This property indicates the boundary at which the NVDIMM may
- need to perform read-modify-write cycles to maintain ECC (Error
- Correcting Code) blocks.
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index e2235ed3e4be..b87252bf4571 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -1978,19 +1978,8 @@ static ssize_t range_index_show(struct device *dev,
}
static DEVICE_ATTR_RO(range_index);
-static ssize_t ecc_unit_size_show(struct device *dev,
- struct device_attribute *attr, char *buf)
-{
- struct nd_region *nd_region = to_nd_region(dev);
- struct nfit_spa *nfit_spa = nd_region_provider_data(nd_region);
-
- return sprintf(buf, "%d\n", nfit_spa->clear_err_unit);
-}
-static DEVICE_ATTR_RO(ecc_unit_size);
-
static struct attribute *acpi_nfit_region_attributes[] = {
&dev_attr_range_index.attr,
- &dev_attr_ecc_unit_size.attr,
NULL,
};
2 years, 8 months
Re: [patch 4/4] dm-writecache: use new API for flushing
by Mike Snitzer
On Fri, May 25 2018 at 2:17am -0400,
Mikulas Patocka <mpatocka(a)redhat.com> wrote:
>
>
> On Thu, 24 May 2018, Dan Williams wrote:
>
> > On Fri, May 18, 2018 at 10:25 PM, Mikulas Patocka <mpatocka(a)redhat.com> wrote:
> > > Use new API for flushing persistent memory.
> > >
> > > The problem is this:
> > > * on X86-64, non-temporal stores have the best performance
> > > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
> > > should flush cache as late as possible, because it performs better this
> > > way.
> > >
> > > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
> > > data persistently, all three functions must be called.
> > >
> > > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> > > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
> > >
> > > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> > > pmem_commit is wmb.
> > >
> > > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> > > pmem_commit is empty.
> >
> > I don't want to grow driver-local wrappers for pmem. You should use
> > memcpy_flushcache directly() and if an architecture does not define
> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
> > see a need to add a standalone flush operation if all relevant archs
> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
> > directly since all archs define it. Alternatively we could introduce
> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
> > routine and memcpy_flushcache() would imply a wmb().
>
> But memcpy_flushcache() on ARM64 is slow.
Yes, Dan can you please take some time to fully appreciate what this
small wrapper API is providing (maybe you've done that, but your recent
reply is mixed-message). Seems you're keeping the innovation it
provides at arms-length. Basically the PMEM APIs you've helped
construct are lacking, forcing DM developers to own fixing them is an
inversion that only serves to stonewall.
Please, less time on stonewalling and more time lifting this wrapper
API; otherwise the dm-writecache local wrapper API is a near-term means
to an end.
I revised the dm-writecache patch header yesterday, please review:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.gi...
(the last paragraph in particular, I intend to move forward with this
wrapper unless someone on the PMEM side of the house steps up and lifts
it up between now and when the 4.18 merge window opens)
dm: add writecache target
The writecache target caches writes on persistent memory or SSD.
It is intended for databases or other programs that need extremely low
commit latency.
The writecache target doesn't cache reads because reads are supposed to
be cached in page cache in normal RAM.
The following describes the approach used to provide the most efficient
flushing of persistent memory on X86_64 vs ARM64:
* On X86_64 non-temporal stores (i.e. memcpy_flushcache) are faster
than cached writes and cache flushing.
* The ARM64 architecture doesn't have non-temporal stores. So,
memcpy_flushcache on ARM does memcpy (that writes data to the cache)
and then flushes the cache. But this eager cache flushig is slower
than late cache flushing.
The optimal code sequence on ARM to write to persistent memory is to
call memcpy, then do something else, and then call arch_wb_cache_pmem as
late as possible. And this ARM-optimized code sequence is just horribly
slow on X86.
This issue can't be "fixed" in ARM-specific source code. The ARM
processor have characteristics such that eager cache flushing is slower
than late cache flushing - and that's it - you can't change processor
behavior.
We introduce a wrapper API for flushing persistent memory with functions
pmem_memcpy, pmem_flush and pmem_commit. To commit data persistently,
all three functions must be called.
The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
(unlike pmem_memcpy) guarantees that 8-byte values are written
atomically.
On X86_64, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
pmem_commit is wmb.
On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
pmem_commit is empty.
It is clear that this wrapper API for flushing persistent memory needs
to be elevated out of this dm-writecache driver. But that can happen
later without requiring DM developers to blaze new trails on pmem
specific implementation details/quirks (pmem developers need to clean up
their APIs given they are already spread across CONFIG_ARCH_HAS_PMEM_API
and CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE and they absolutely don't take
into account the duality of the different programming models needed to
achieve optimal cross-architecture use of persistent memory).
Signed-off-by: Mikulas Patocka <mpatocka(a)redhat.com>
Signed-off-by: Mike Snitzer <msnitzer(a)redhat.com>
2 years, 8 months
[PATCH] libnvdimm, e820: Register all pmem resources
by Dan Williams
There is currently a mismatch between the resources that will trigger
the e820_pmem driver to register/load and the resources that will
actually be surfaced as pmem ranges. register_e820_pmem() uses
walk_iomem_res_desc() which includes children and siblings. In contrast,
e820_pmem_probe() only considers top level resources. For example the
following resource tree results in the driver being loaded, but no
resources being registered:
398000000000-39bfffffffff : PCI Bus 0000:ae
39be00000000-39bf07ffffff : PCI Bus 0000:af
39be00000000-39beffffffff : 0000:af:00.0
39be10000000-39beffffffff : Persistent Memory (legacy)
Fix this up to allow definitions of "legacy" pmem ranges anywhere in
system-physical address space. Not that it is a recommended or safe to
define a pmem range in PCI space, but it is useful for debug /
experimentation, and the restriction on being a top-level resource was
arbitrary.
Cc: Christoph Hellwig <hch(a)lst.de>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
drivers/nvdimm/e820.c | 41 ++++++++++++++++++++++-------------------
kernel/resource.c | 1 +
2 files changed, 23 insertions(+), 19 deletions(-)
diff --git a/drivers/nvdimm/e820.c b/drivers/nvdimm/e820.c
index 6f9a6ffd7cde..521eaf53a52a 100644
--- a/drivers/nvdimm/e820.c
+++ b/drivers/nvdimm/e820.c
@@ -38,12 +38,27 @@ static int e820_range_to_nid(resource_size_t addr)
}
#endif
+static int e820_register_one(struct resource *res, void *data)
+{
+ struct nd_region_desc ndr_desc;
+ struct nvdimm_bus *nvdimm_bus = data;
+
+ memset(&ndr_desc, 0, sizeof(ndr_desc));
+ ndr_desc.res = res;
+ ndr_desc.attr_groups = e820_pmem_region_attribute_groups;
+ ndr_desc.numa_node = e820_range_to_nid(res->start);
+ set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
+ if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
+ return -ENXIO;
+ return 0;
+}
+
static int e820_pmem_probe(struct platform_device *pdev)
{
static struct nvdimm_bus_descriptor nd_desc;
struct device *dev = &pdev->dev;
struct nvdimm_bus *nvdimm_bus;
- struct resource *p;
+ int rc = -ENXIO;
nd_desc.attr_groups = e820_pmem_attribute_groups;
nd_desc.provider_name = "e820";
@@ -53,27 +68,15 @@ static int e820_pmem_probe(struct platform_device *pdev)
goto err;
platform_set_drvdata(pdev, nvdimm_bus);
- for (p = iomem_resource.child; p ; p = p->sibling) {
- struct nd_region_desc ndr_desc;
-
- if (p->desc != IORES_DESC_PERSISTENT_MEMORY_LEGACY)
- continue;
-
- memset(&ndr_desc, 0, sizeof(ndr_desc));
- ndr_desc.res = p;
- ndr_desc.attr_groups = e820_pmem_region_attribute_groups;
- ndr_desc.numa_node = e820_range_to_nid(p->start);
- set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
- if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
- goto err;
- }
-
+ rc = walk_iomem_res_desc(IORES_DESC_PERSISTENT_MEMORY_LEGACY,
+ IORESOURCE_MEM, 0, -1, nvdimm_bus, e820_register_one);
+ if (rc)
+ goto err;
return 0;
-
- err:
+err:
nvdimm_bus_unregister(nvdimm_bus);
dev_err(dev, "failed to register legacy persistent memory ranges\n");
- return -ENXIO;
+ return rc;
}
static struct platform_driver e820_pmem_driver = {
diff --git a/kernel/resource.c b/kernel/resource.c
index 2af6c03858b9..b85f59e8a4b8 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -448,6 +448,7 @@ int walk_iomem_res_desc(unsigned long desc, unsigned long flags, u64 start,
return __walk_iomem_res_desc(&res, desc, false, arg, func);
}
+EXPORT_SYMBOL_GPL(walk_iomem_res_desc);
/*
* This function calls the @func callback against all memory ranges of type
2 years, 8 months