[PATCH] tools/testing/nvdimm: fix SIGTERM vs hotplug crash
by Dan Williams
The unit tests crash when hotplug races the previous probe. This race
requires that the loading of the nfit_test module be terminated with
SIGTERM, and the module to be unloaded while the ars scan is still
running.
In contrast to the normal nfit driver, the unit test calls
acpi_nfit_init() twice to simulate hotplug, whereas the nominal case
goes through the acpi_nfit_notify() event handler. The
acpi_nfit_notify() path is careful to flush the previous region
registration before servicing the hotplug event. The unit test was
missing this guarantee.
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff810cdce7>] pwq_activate_delayed_work+0x47/0x170
[..]
Call Trace:
[<ffffffff810ce186>] pwq_dec_nr_in_flight+0x66/0xa0
[<ffffffff810ce490>] process_one_work+0x2d0/0x680
[<ffffffff810ce331>] ? process_one_work+0x171/0x680
[<ffffffff810ce88e>] worker_thread+0x4e/0x480
[<ffffffff810ce840>] ? process_one_work+0x680/0x680
[<ffffffff810ce840>] ? process_one_work+0x680/0x680
[<ffffffff810d5343>] kthread+0xf3/0x110
[<ffffffff8199846f>] ret_from_fork+0x1f/0x40
[<ffffffff810d5250>] ? kthread_create_on_node+0x230/0x230
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
tools/testing/nvdimm/test/nfit.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c
index 5404efa578a3..dd48f421844c 100644
--- a/tools/testing/nvdimm/test/nfit.c
+++ b/tools/testing/nvdimm/test/nfit.c
@@ -13,6 +13,7 @@
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
#include <linux/platform_device.h>
#include <linux/dma-mapping.h>
+#include <linux/workqueue.h>
#include <linux/libnvdimm.h>
#include <linux/vmalloc.h>
#include <linux/device.h>
@@ -1474,6 +1475,7 @@ static int nfit_test_probe(struct platform_device *pdev)
if (nfit_test->setup != nfit_test0_setup)
return 0;
+ flush_work(&acpi_desc->work);
nfit_test->setup_hotplug = 1;
nfit_test->setup(nfit_test);
4 years, 5 months
[PATCH 1/3] radix-tree: 'slot' can be NULL in radix_tree_next_slot()
by Ross Zwisler
There are four cases I can see where we could end up with a NULL 'slot' in
radix_tree_next_slot(). Yet radix_tree_next_slot() never actually checks
whether 'slot' is NULL. It just happens that for the cases where 'slot' is
NULL, some other combination of factors prevents us from dereferencing it.
It would be very easy for someone to unwittingly change one of these
factors without realizing that we are implicitly depending on it to save us
from a NULL pointer dereference.
So, explicitly account for the fact that that 'slot' can be NULL in
radix_tree_next_slot() and save ourselves from future crashes and debugging
efforts.
Here are details on the four cases:
1) radix_tree_iter_retry() via a non-tagged iteration like
radix_tree_for_each_slot(). In this case we currently aren't seeing a bug
because radix_tree_iter_retry() sets
iter->next_index = iter->index;
which means that in in the else case in radix_tree_next_slot(), 'count' is
zero, so we skip over the while() loop and effectively just return NULL
without ever dereferencing 'slot'.
2) radix_tree_iter_retry() via tagged iteration like
radix_tree_for_each_tagged(). This case was giving us NULL pointer
dereferences in testing, and was fixed with this commit:
commit 3cb9185c6730 ("radix-tree: fix radix_tree_iter_retry() for tagged
iterators.")
This fix doesn't explicitly check for 'slot' being NULL, though, it works
around the NULL pointer dereference by instead zeroing iter->tags in
radix_tree_iter_retry(), which makes us bail out of the if() case in
radix_tree_next_slot() before we dereference 'slot'.
3) radix_tree_iter_next() via via a non-tagged iteration like
radix_tree_for_each_slot(). This currently happens in shmem_tag_pins()
and shmem_partial_swap_usage().
As with non-tagged iteration, 'count' in the else case of
radix_tree_next_slot() is zero, so we skip over the while() loop and
effectively just return NULL without ever dereferencing 'slot'.
4) radix_tree_iter_next() via tagged iteration like
radix_tree_for_each_tagged(). This happens in shmem_wait_for_pins().
radix_tree_iter_next() zeros out iter->tags, so we end up exiting
radix_tree_next_slot() here:
if (flags & RADIX_TREE_ITER_TAGGED) {
void *canon = slot;
iter->tags >>= 1;
if (unlikely(!iter->tags))
return NULL;
Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
---
include/linux/radix-tree.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 4c45105..1bf16ed 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -465,6 +465,9 @@ static inline struct radix_tree_node *entry_to_node(void *ptr)
static __always_inline void **
radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
{
+ if (unlikely(!slot))
+ return NULL;
+
if (flags & RADIX_TREE_ITER_TAGGED) {
void *canon = slot;
--
2.9.0
4 years, 5 months
[PATCH 0/15 v2] dax: Clear dirty bits after flushing caches
by Jan Kara
Hello,
this is a second revision of my patches to clear dirty bits from radix tree of
DAX inodes when caches for corresponding pfns have been flushed. This patch set
is significantly larger than the previous version because I'm changing how
->fault, ->page_mkwrite, and ->pfn_mkwrite handlers may choose to handle the
fault so that we don't have to leak details about DAX locking into the generic
code. In principle, these patches enable handlers to easily update PTEs and do
other work necessary to finish the fault without duplicating the functionality
present in the generic code. I'd be really interested in feedback from mm
folks whether such changes to fault handling code are fine or what they'd do
differently...
Changes since v1:
* make sure all PTE updates happen under radix tree entry lock to protect
against races between faults & write-protecting code
* remove information about DAX locking from mm/memory.c
* smaller updates based on Ross' feedback
----
Background information regarding the motivation:
Currently we never clear dirty bits in the radix tree of a DAX inode. Thus
fsync(2) flushes all the dirty pfns again and again. This patches implement
clearing of the dirty tag in the radix tree so that we issue flush only when
needed.
The difficulty with clearing the dirty tag is that we have to protect against
a concurrent page fault setting the dirty tag and writing new data into the
page. So we need a lock serializing page fault and clearing of the dirty tag
and write-protecting PTEs (so that we get another pagefault when pfn is written
to again and we have to set the dirty tag again).
The effect of the patch set is easily visible:
Writing 1 GB of data via mmap, then fsync twice.
Before this patch set both fsyncs take ~205 ms on my test machine, after the
patch set the first fsync takes ~283 ms (the additional cost of walking PTEs,
clearing dirty bits etc. is very noticeable), the second fsync takes below
1 us.
As a bonus, these patches make filesystem freezing for DAX filesystems
reliable because mappings are now properly writeprotected while freezing the
fs.
Patches have passed xfstests for both xfs and ext4.
Honza
4 years, 5 months
Subtle races between DAX mmap fault and write path
by Jan Kara
Hi,
when testing my latest changes to DXA fault handling code I have hit the
following interesting race between the fault and write path (I'll show
function names for ext4 but xfs has the same issue AFAICT).
We have a file 'f' which has a hole at offset 0.
Process 0 Process 1
data = mmap('f');
read data[0]
-> fault, we map a hole page
pwrite('f', buf, len, 0)
-> ext4_file_write_iter
inode_lock(inode);
__generic_file_write_iter()
generic_file_direct_write()
invalidate_inode_pages2_range()
- drops hole page from
the radix tree
ext4_direct_IO()
dax_do_io()
- allocates block for
offset 0
data[0] = 1
-> page_mkwrite fault
-> ext4_dax_fault()
down_read(&EXT4_I(inode)->i_mmap_sem);
__dax_fault()
grab_mapping_entry()
- creates locked radix tree entry
- maps block into PTE
put_locked_mapping_entry()
invalidate_inode_pages2_range()
- removes dax entry from
the radix tree
So we have just lost information that block 0 is mapped and needs flushing
caches.
Also the fact that the consistency of data as viewed by mmap and
dax_do_io() relies on invalidate_inode_pages2_range() is somewhat
unexpected to me and we should document it somewhere.
The question is how to best fix this. I see three options:
1) Lock out faults during writes via exclusive i_mmap_sem. That is rather
harsh but should work - we call filemap_write_and_wait() in
generic_file_direct_write() so we flush out all caches for the relevant
area before dropping radix tree entries.
2) Call filemap_write_and_wait() after we return from ->direct_IO before we
call invalidate_inode_pages2_range() and hold i_mmap_sem exclusively only
for those two calls. Lock hold time will be shorter than 1) but it will
require additional flush and we'd probably have to stop using
generic_file_direct_write() for DAX writes to allow for all this special
hackery.
3) Remodel dax_do_io() to work more like buffered IO and use radix tree
entry locks to protect against similar races. That has likely better
scalability than 1) but may be actually slower in the uncontended case (due
to all the radix tree operations).
Any opinions on this?
Honza
--
Jan Kara <jack(a)suse.com>
SUSE Labs, CR
4 years, 5 months
[PATCH] libnvdimm, nd_blk: mask off reserved status bits
by Ross Zwisler
The "NVDIMM Block Window Driver Writer's Guide":
http://pmem.io/documents/
http://pmem.io/documents/NVDIMM_DriverWritersGuide-July-2016.pdf
defines the layout of the block window status register. For the July 2016
version of the spec linked to above, this happens in Figure 4 on page 26.
The only bits defined in this spec are bits 31, 5, 4, 2, 1 and 0. The rest
of the bits in the status register are reserved, and there is a warning
following the diagram that says:
Note: The driver cannot assume the value of the RESERVED bits in the
status register are zero. These reserved bits need to be masked off, and
the driver must avoid checking the state of those bits.
This change ensures that for hardware implementations that set these
reserved bits in the status register, the driver won't incorrectly fail the
block I/Os.
Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: stable(a)vger.kernel.org
---
drivers/acpi/nfit.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 1f0e060..375c10f 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -1396,11 +1396,12 @@ static u32 read_blk_stat(struct nfit_blk *nfit_blk, unsigned int bw)
{
struct nfit_blk_mmio *mmio = &nfit_blk->mmio[DCR];
u64 offset = nfit_blk->stat_offset + mmio->size * bw;
+ const u32 STATUS_MASK = 0x80000037;
if (mmio->num_lines)
offset = to_interleave_offset(offset, mmio);
- return readl(mmio->addr.base + offset);
+ return readl(mmio->addr.base + offset) & STATUS_MASK;
}
static void write_blk_ctl(struct nfit_blk *nfit_blk, unsigned int bw,
--
2.9.0
4 years, 5 months
上海福而伟建筑装潢工程有限公司
by zzdtm
linux-nvdimm
你好!
你有没有“召之即来,来之能战,战之能胜”的队伍。
zzdtm
18:11:03
4 years, 5 months
Reach Millions of FB Group Members
by GREGORY
Hello,
Do you want to advertise on facebook? We're here to help.
We wil manually post your product/logo/link on Facebook Groups and will
give you a full report with links of each live post where your advertisement
was posted.
http://www.buysocial.cn/detail.php?id=12
Regards
GREGORY
�
Unsubscribe option is available on the footer of our website
4 years, 5 months
Reach Millions of FB Group Members
by JOSHUA
Hello,
Do you want to advertise on facebook? We're here to help.
We wil manually post your product/logo/link on Facebook Groups and will
give you a full report with links of each live post where your advertisement
was posted.
http://www.buysocial.cn/detail.php?id=12
Regards
JOSHUA
Unsubscribe option is available on the footer of our website
4 years, 5 months
[ndctl PATCH v3] ndctl, create-namespace: trap and report invalid sector-size values
by Dan Williams
While the kernel will prevent invalid configurations, the user
experience of a kernel error message and disabling of the namespace is
too harsh. Trap and report these attempts to make create-namespace more
user friendly.
# ndctl create-namespace -e namespace0.0 -m sector -l 513 -f -v
validate_namespace_options:437: region0: does not support btt sector_size 513
failed to reconfigure namespace
Reported-by: Yi Zhang <yizhan(a)redhat.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
Change in v3: fix a compile warning
ndctl/builtin-xaction-namespace.c | 51 +++++++++++++++++++++++++++++++------
1 file changed, 43 insertions(+), 8 deletions(-)
diff --git a/ndctl/builtin-xaction-namespace.c b/ndctl/builtin-xaction-namespace.c
index 8304203750d8..9c7a64ef38d9 100644
--- a/ndctl/builtin-xaction-namespace.c
+++ b/ndctl/builtin-xaction-namespace.c
@@ -210,8 +210,13 @@ static int set_defaults(enum namespace_action mode)
error("'pmem' namespaces do not support setting 'sector size'\n");
rc = -EINVAL;
}
- } else if (!param.reconfig)
- param.sector_size = "4096";
+ } else if (!param.reconfig
+ && ((param.type && strcmp(param.type, "blk") == 0)
+ || (param.mode
+ && strcmp(param.mode, "safe") == 0))) {
+ /* default sector size for blk-type or safe-mode */
+ param.sector_size = "4096";
+ }
return rc;
}
@@ -415,14 +420,44 @@ static int validate_namespace_options(struct ndctl_region *region,
p->mode = ndctl_namespace_get_mode(ndns);
if (param.sector_size) {
- p->sector_size = parse_size64(param.sector_size);
+ struct ndctl_btt *btt;
+ int num, i;
- if (ndns && p->mode != NDCTL_NS_MODE_SAFE
- && ndctl_namespace_get_type(ndns)
- == ND_DEVICE_NAMESPACE_PMEM) {
- debug("%s: does not support sector_size modification\n",
- ndctl_namespace_get_devname(ndns));
+ p->sector_size = parse_size64(param.sector_size);
+ btt = ndctl_region_get_btt_seed(region);
+ if (p->mode == NDCTL_NS_MODE_SAFE) {
+ if (!btt) {
+ debug("%s: does not support 'sector' mode\n",
+ ndctl_region_get_devname(region));
+ return -EINVAL;
+ }
+ num = ndctl_btt_get_num_sector_sizes(btt);
+ for (i = 0; i < num; i++)
+ if (ndctl_btt_get_supported_sector_size(btt, i)
+ == p->sector_size)
+ break;
+ if (i >= num) {
+ debug("%s: does not support btt sector_size %lu\n",
+ ndctl_region_get_devname(region),
+ p->sector_size);
+ return -EINVAL;
+ }
+ } else {
+ struct ndctl_namespace *seed = ndns;
+
+ if (!seed)
+ seed = ndctl_region_get_namespace_seed(region);
+ num = ndctl_namespace_get_num_sector_sizes(seed);
+ for (i = 0; i < num; i++)
+ if (ndctl_namespace_get_supported_sector_size(seed, i)
+ == p->sector_size)
+ break;
+ if (i >= num) {
+ debug("%s: does not support namespace sector_size %lu\n",
+ ndctl_region_get_devname(region),
+ p->sector_size);
return -EINVAL;
+ }
}
} else if (ndns) {
struct ndctl_btt *btt = ndctl_namespace_get_btt(ndns);
4 years, 5 months