Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
by Andiry Xu
On Tue, Jan 17, 2017 at 4:16 PM, Andreas Dilger <adilger(a)dilger.ca> wrote:
> On Jan 17, 2017, at 3:15 PM, Andiry Xu <andiry(a)gmail.com> wrote:
>> On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma <vishal.l.verma(a)intel.com> wrote:
>>> On 01/16, Darrick J. Wong wrote:
>>>> On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
>>>>> On 01/14, Slava Dubeyko wrote:
>>>>>>
>>>>>> ---- Original Message ----
>>>>>> Subject: [LSF/MM TOPIC] Badblocks checking/representation in filesystems
>>>>>> Sent: Jan 13, 2017 1:40 PM
>>>>>> From: "Verma, Vishal L" <vishal.l.verma(a)intel.com>
>>>>>> To: lsf-pc(a)lists.linux-foundation.org
>>>>>> Cc: linux-nvdimm(a)lists.01.org, linux-block(a)vger.kernel.org, linux-fsdevel(a)vger.kernel.org
>>>>>>
>>>>>>> The current implementation of badblocks, where we consult the
>>>>>>> badblocks list for every IO in the block driver works, and is a
>>>>>>> last option failsafe, but from a user perspective, it isn't the
>>>>>>> easiest interface to work with.
>>>>>>
>>>>>> As I remember, FAT and HFS+ specifications contain description of bad blocks
>>>>>> (physical sectors) table. I believe that this table was used for the case of
>>>>>> floppy media. But, finally, this table becomes to be the completely obsolete
>>>>>> artefact because mostly storage devices are reliably enough. Why do you need
>>>>
>>>> ext4 has a badblocks inode to own all the bad spots on disk, but ISTR it
>>>> doesn't support(??) extents or 64-bit filesystems, and might just be a
>>>> vestigial organ at this point. XFS doesn't have anything to track bad
>>>> blocks currently....
>>>>
>>>>>> in exposing the bad blocks on the file system level? Do you expect that next
>>>>>> generation of NVM memory will be so unreliable that file system needs to manage
>>>>>> bad blocks? What's about erasure coding schemes? Do file system really need to suffer
>>>>>> from the bad block issue?
>>>>>>
>>>>>> Usually, we are using LBAs and it is the responsibility of storage device to map
>>>>>> a bad physical block/page/sector into valid one. Do you mean that we have
>>>>>> access to physical NVM memory address directly? But it looks like that we can
>>>>>> have a "bad block" issue even we will access data into page cache's memory
>>>>>> page (if we will use NVM memory for page cache, of course). So, what do you
>>>>>> imply by "bad block" issue?
>>>>>
>>>>> We don't have direct physical access to the device's address space, in
>>>>> the sense the device is still free to perform remapping of chunks of NVM
>>>>> underneath us. The problem is that when a block or address range (as
>>>>> small as a cache line) goes bad, the device maintains a poison bit for
>>>>> every affected cache line. Behind the scenes, it may have already
>>>>> remapped the range, but the cache line poison has to be kept so that
>>>>> there is a notification to the user/owner of the data that something has
>>>>> been lost. Since NVM is byte addressable memory sitting on the memory
>>>>> bus, such a poisoned cache line results in memory errors and SIGBUSes.
>>>>> Compared to tradational storage where an app will get nice and friendly
>>>>> (relatively speaking..) -EIOs. The whole badblocks implementation was
>>>>> done so that the driver can intercept IO (i.e. reads) to _known_ bad
>>>>> locations, and short-circuit them with an EIO. If the driver doesn't
>>>>> catch these, the reads will turn into a memory bus access, and the
>>>>> poison will cause a SIGBUS.
>>>>
>>>> "driver" ... you mean XFS? Or do you mean the thing that makes pmem
>>>> look kind of like a traditional block device? :)
>>>
>>> Yes, the thing that makes pmem look like a block device :) --
>>> drivers/nvdimm/pmem.c
>>>
>>>>
>>>>> This effort is to try and make this badblock checking smarter - and try
>>>>> and reduce the penalty on every IO to a smaller range, which only the
>>>>> filesystem can do.
>>>>
>>>> Though... now that XFS merged the reverse mapping support, I've been
>>>> wondering if there'll be a resubmission of the device errors callback?
>>>> It still would be useful to be able to inform the user that part of
>>>> their fs has gone bad, or, better yet, if the buffer is still in memory
>>>> someplace else, just write it back out.
>>>>
>>>> Or I suppose if we had some kind of raid1 set up between memories we
>>>> could read one of the other copies and rewrite it into the failing
>>>> region immediately.
>>>
>>> Yes, that is kind of what I was hoping to accomplish via this
>>> discussion. How much would filesystems want to be involved in this sort
>>> of badblocks handling, if at all. I can refresh my patches that provide
>>> the fs notification, but that's the easy bit, and a starting point.
>>>
>>
>> I have some questions. Why moving badblock handling to file system
>> level avoid the checking phase? In file system level for each I/O I
>> still have to check the badblock list, right? Do you mean during mount
>> it can go through the pmem device and locates all the data structures
>> mangled by badblocks and handle them accordingly, so that during
>> normal running the badblocks will never be accessed? Or, if there is
>> replicataion/snapshot support, use a copy to recover the badblocks?
>
> With ext4 badblocks, the main outcome is that the bad blocks would be
> pemanently marked in the allocation bitmap as being used, and they would
> never be allocated to a file, so they should never be accessed unless
> doing a full device scan (which ext4 and e2fsck never do). That would
> avoid the need to check every I/O against the bad blocks list, if the
> driver knows that the filesystem will handle this.
>
Thank you for explanation. However this only works for free blocks,
right? What about allocated blocks, like file data and metadata?
Thanks,
Andiry
> The one caveat is that ext4 only allows 32-bit block numbers in the
> badblocks list, since this feature hasn't been used in a long time.
> This is good for up to 16TB filesystems, but if there was a demand to
> use this feature again it would be possible allow 64-bit block numbers.
>
> Cheers, Andreas
>
>
>
>
>
4 years
[PATCH v3 00/12] mm: sub-section memory hotplug support
by Dan Williams
Changes since v2 [1]:
1/ Fixed a bug inserting multi-order entries into pgmap_radix. The
insert index needs to be 'order' aligned.
2/ Fixed a __meminit section mismatch warning for section_activate()
3/ Forward ported to v4.10-rc4
[1]: https://lwn.net/Articles/708627/
---
The initial motivation for this change is persistent memory platforms
that, unfortunately, align the pmem range on a boundary less than a full
section (64M vs 128M), and may change the alignment from one boot to the
next. A secondary motivation is the arrival of prospective ZONE_DEVICE
users that want devm_memremap_pages() to map PCI-E device memory ranges
to enable peer-to-peer DMA. There is a range of possible physical
address alignments of PCI-E BARs that are less than 128M.
Currently the libnvdimm core injects padding when 'pfn' (struct page
mapping configuration) instances are created. However, not all users of
devm_memremap_pages() have the opportunity to inject such padding. Users
of the memmap=ss!nn kernel command line option can trigger the following
failure with unaligned parameters like "memmap=0xfc000000!8G":
WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300
devm_memremap_pages attempted on mixed region [mem 0x200000000-0x2fbffffff flags 0x200]
[..]
Call Trace:
[<ffffffff814c0393>] dump_stack+0x86/0xc3
[<ffffffff810b173b>] __warn+0xcb/0xf0
[<ffffffff810b17bf>] warn_slowpath_fmt+0x5f/0x80
[<ffffffff811eb105>] devm_memremap_pages+0x3b5/0x4c0
[<ffffffffa006f308>] __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
[<ffffffffa00e231a>] pmem_attach_disk+0x19a/0x440 [nd_pmem]
Without this change a user could inadvertently lose access to nvdimm
namespaces after a configuration change. The act of adding, removing, or
rearranging DIMMs in the platform could lead to the BIOS changing the
base alignment of the namespace in an incompatible fashion. With this
support we can accommodate a BIOS changing the namespace to any
alignment provided it is >= SECTION_ACTIVE_SIZE.
In other words, we are protecting against misalignment injected by the
BIOS after the libnvdimm sub-system already recorded that the namespace
does not need alignment padding. In that case the user would need to
figure out how to undo the configuration change to regain access to
their nvdimm capacity.
---
The patches have received a build success notification from the
0day-kbuild robot across 142 configs and pass the latest libnvdimm/ndctl
unit test suite. They merge cleanly on top of current -next (test merge
with next-20170118).
---
Dan Williams (12):
mm: fix type width of section to/from pfn conversion macros
mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups
mm: introduce struct mem_section_usage to track partial population of a section
mm: introduce common definitions for the size and mask of a section
mm: cleanup sparse_init_one_section() return value
mm: track active portions of a section at boot
mm: fix register_new_memory() zone type detection
mm: convert kmalloc_section_memmap() to populate_section_memmap()
mm: prepare for hot-{add,remove} of sub-section ranges
mm: support section-unaligned ZONE_DEVICE memory ranges
mm: enable section-unaligned devm_memremap_pages()
libnvdimm, pfn, dax: stop padding pmem namespaces to section alignment
arch/x86/mm/init_64.c | 15 +
drivers/base/memory.c | 26 +-
drivers/nvdimm/pfn_devs.c | 42 +---
include/linux/memory.h | 4
include/linux/memory_hotplug.h | 6 -
include/linux/mm.h | 3
include/linux/mmzone.h | 30 ++-
kernel/memremap.c | 74 ++++---
mm/Kconfig | 1
mm/memory_hotplug.c | 95 ++++----
mm/page_alloc.c | 6 -
mm/sparse-vmemmap.c | 24 +-
mm/sparse.c | 454 +++++++++++++++++++++++++++++-----------
13 files changed, 511 insertions(+), 269 deletions(-)
4 years
USPS parcel #03122484 delivery problem
by USPS Home Delivery
Dear Customer,
Your item has arrived at the USPS Post Office at January 16, but the courier was unable to deliver parcel to you.
You can download the shipment label attached!
Thank you for your time,
Adam Meyers,
USPS Chief Office Manager.
4 years
[LSF/MM TOPIC] Future direction of DAX
by Ross Zwisler
This past year has seen a lot of new DAX development. We have added support
for fsync/msync, moved to the new iomap I/O data structure, introduced radix
tree based locking, re-enabled PMD support (twice!), and have fixed a bunch of
bugs.
We still have a lot of work to do, though, and I'd like to propose a discussion
around what features people would like to see enabled in the coming year as
well as what what use cases their customers have that we might not be aware of.
Here are a few topics to start the conversation:
- The current plan to allow users to safely flush dirty data from userspace is
built around the PMEM_IMMUTABLE feature [1]. I'm hoping that by LSF/MM we
will have at least started work on PMEM_IMMUTABLE, but I'm guessing there
will be more to discuss.
- The DAX fsync/msync model was built for platforms that need to flush dirty
processor cache lines in order to make data durable on NVDIMMs. There exist
platforms, however, that are set up so that the processor caches are
effectively part of the ADR safe zone. This means that dirty data can be
assumed to be durable even in the processor cache, obviating the need to
manually flush the cache during fsync/msync. These platforms still need to
call fsync/msync to ensure that filesystem metadata updates are properly
written to media. Our first idea on how to properly support these platforms
would be for DAX to be made aware that in some cases doesn't need to keep
metadata about dirty cache lines. A similar issue exists for volatile uses
of DAX such as with BRD or with PMEM and the memmap command line parameter,
and we'd like a solution that covers them all.
- If I recall correctly, at one point Dave Chinner suggested that we change
DAX so that I/O would use cached stores instead of the non-temporal stores
that it currently uses. We would then track pages that were written to by
DAX in the radix tree so that they would be flushed later during
fsync/msync. Does this sound like a win? Also, assuming that we can find a
solution for platforms where the processor cache is part of the ADR safe
zone (above topic) this would be a clear improvement, moving us from using
non-temporal stores to faster cached stores with no downside.
- Jan suggested [2] that we could use the radix tree as a cache to service DAX
faults without needing to call into the filesystem. Are there any issues
with this approach, and should we move forward with it as an optimization?
- Whenever you mount a filesystem with DAX, it spits out a message that says
"DAX enabled. Warning: EXPERIMENTAL, use at your own risk". What criteria
needs to be met for DAX to no longer be considered experimental?
- When we msync() a huge page, if the range is less than the entire huge page,
should we flush the entire huge page and mark it clean in the radix tree, or
should we only flush the requested range and leave the radix tree entry
dirty?
- Should we enable 1 GiB huge pages in filesystem DAX? Does anyone have any
specific customer requests for this or performance data suggesting it would
be a win? If so, what work needs to be done to get 1 GiB sized and aligned
filesystem block allocations, to get the required enabling in the MM layer,
etc?
Thanks,
- Ross
[1] https://lkml.org/lkml/2016/12/19/571
[2] https://lkml.org/lkml/2016/10/12/70
4 years
USPS parcel #1229714 delivery problem
by USPS Delivery
Dear Customer,
Your item has arrived at the USPS Post Office at January 17, but the courier was unable to deliver parcel to you.
Please check the attachment for details!
Sincerely,
Rafael Gay,
USPS Delivery Agent.
4 years
[PATCH v7 1/2] mm, dax: make pmd_fault() and friends to be the same as fault()
by Dave Jiang
Instead of passing in multiple parameters in the pmd_fault() handler,
a vmf can be passed in just like a fault() handler. This will simplify
code and remove the need for the actual pmd fault handlers to allocate a
vmf. Related functions are also modified to do the same.
Signed-off-by: Dave Jiang <dave.jiang(a)intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Reviewed-by: Jan Kara <jack(a)suse.cz>
---
drivers/dax/dax.c | 16 +++++++---------
fs/dax.c | 28 +++++++++++-----------------
fs/ext4/file.c | 9 ++++-----
fs/xfs/xfs_file.c | 10 ++++------
include/linux/dax.h | 7 +++----
include/linux/mm.h | 3 +--
include/trace/events/fs_dax.h | 15 +++++++--------
mm/memory.c | 6 ++----
8 files changed, 39 insertions(+), 55 deletions(-)
v7:
Xiong found issue with xfs_tests stall when DAX option is turned out. It
seems the issue has to do with dax_iomap_pmd_fault(). The pgoff needs
to be linear_page_index(vma, vmf->address & PMD_MASK) in this funciton.
However, the passed in address is linear_page_index(vma, vmf->address)
and based on the PTE and not the PMD. The vmf->pgoff needs to be what was
passed in when returned. Therefore we go back to having a local variable
of pgoff in the dax_iomap_pmd_fault that is based on address & PMD_MASK.
diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
index ed758b7..a8833cc 100644
--- a/drivers/dax/dax.c
+++ b/drivers/dax/dax.c
@@ -473,10 +473,9 @@ static int dax_dev_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
}
static int __dax_dev_pmd_fault(struct dax_dev *dax_dev,
- struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd,
- unsigned int flags)
+ struct vm_area_struct *vma, struct vm_fault *vmf)
{
- unsigned long pmd_addr = addr & PMD_MASK;
+ unsigned long pmd_addr = vmf->address & PMD_MASK;
struct device *dev = &dax_dev->dev;
struct dax_region *dax_region;
phys_addr_t phys;
@@ -508,23 +507,22 @@ static int __dax_dev_pmd_fault(struct dax_dev *dax_dev,
pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
- return vmf_insert_pfn_pmd(vma, addr, pmd, pfn,
- flags & FAULT_FLAG_WRITE);
+ return vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, pfn,
+ vmf->flags & FAULT_FLAG_WRITE);
}
-static int dax_dev_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
- pmd_t *pmd, unsigned int flags)
+static int dax_dev_pmd_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
int rc;
struct file *filp = vma->vm_file;
struct dax_dev *dax_dev = filp->private_data;
dev_dbg(&dax_dev->dev, "%s: %s: %s (%#lx - %#lx)\n", __func__,
- current->comm, (flags & FAULT_FLAG_WRITE)
+ current->comm, (vmf->flags & FAULT_FLAG_WRITE)
? "write" : "read", vma->vm_start, vma->vm_end);
rcu_read_lock();
- rc = __dax_dev_pmd_fault(dax_dev, vma, addr, pmd, flags);
+ rc = __dax_dev_pmd_fault(dax_dev, vma, vmf);
rcu_read_unlock();
return rc;
diff --git a/fs/dax.c b/fs/dax.c
index 4671530..5bed549 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1332,18 +1332,17 @@ static int dax_pmd_load_hole(struct vm_area_struct *vma, pmd_t *pmd,
return VM_FAULT_FALLBACK;
}
-int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
- pmd_t *pmd, unsigned int flags, struct iomap_ops *ops)
+int dax_iomap_pmd_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+ struct iomap_ops *ops)
{
struct address_space *mapping = vma->vm_file->f_mapping;
- unsigned long pmd_addr = address & PMD_MASK;
- bool write = flags & FAULT_FLAG_WRITE;
+ unsigned long pmd_addr = vmf->address & PMD_MASK;
+ bool write = vmf->flags & FAULT_FLAG_WRITE;
unsigned int iomap_flags = (write ? IOMAP_WRITE : 0) | IOMAP_FAULT;
struct inode *inode = mapping->host;
int result = VM_FAULT_FALLBACK;
struct iomap iomap = { 0 };
pgoff_t max_pgoff, pgoff;
- struct vm_fault vmf;
void *entry;
loff_t pos;
int error;
@@ -1356,7 +1355,7 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
pgoff = linear_page_index(vma, pmd_addr);
max_pgoff = (i_size_read(inode) - 1) >> PAGE_SHIFT;
- trace_dax_pmd_fault(inode, vma, address, flags, pgoff, max_pgoff, 0);
+ trace_dax_pmd_fault(inode, vma, vmf, max_pgoff, 0);
/* Fall back to PTEs if we're going to COW */
if (write && !(vma->vm_flags & VM_SHARED))
@@ -1400,21 +1399,17 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
if (IS_ERR(entry))
goto finish_iomap;
- vmf.pgoff = pgoff;
- vmf.flags = flags;
- vmf.gfp_mask = mapping_gfp_mask(mapping) | __GFP_IO;
-
switch (iomap.type) {
case IOMAP_MAPPED:
- result = dax_pmd_insert_mapping(vma, pmd, &vmf, address,
- &iomap, pos, write, &entry);
+ result = dax_pmd_insert_mapping(vma, vmf->pmd, vmf,
+ vmf->address, &iomap, pos, write, &entry);
break;
case IOMAP_UNWRITTEN:
case IOMAP_HOLE:
if (WARN_ON_ONCE(write))
goto unlock_entry;
- result = dax_pmd_load_hole(vma, pmd, &vmf, address, &iomap,
- &entry);
+ result = dax_pmd_load_hole(vma, vmf->pmd, vmf, vmf->address,
+ &iomap, &entry);
break;
default:
WARN_ON_ONCE(1);
@@ -1440,12 +1435,11 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
}
fallback:
if (result == VM_FAULT_FALLBACK) {
- split_huge_pmd(vma, pmd, address);
+ split_huge_pmd(vma, vmf->pmd, vmf->address);
count_vm_event(THP_FAULT_FALLBACK);
}
out:
- trace_dax_pmd_fault_done(inode, vma, address, flags, pgoff, max_pgoff,
- result);
+ trace_dax_pmd_fault_done(inode, vma, vmf, max_pgoff, result);
return result;
}
EXPORT_SYMBOL_GPL(dax_iomap_pmd_fault);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index d663d3d..10b64ba 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -275,21 +275,20 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
return result;
}
-static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
- pmd_t *pmd, unsigned int flags)
+static int
+ext4_dax_pmd_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
int result;
struct inode *inode = file_inode(vma->vm_file);
struct super_block *sb = inode->i_sb;
- bool write = flags & FAULT_FLAG_WRITE;
+ bool write = vmf->flags & FAULT_FLAG_WRITE;
if (write) {
sb_start_pagefault(sb);
file_update_time(vma->vm_file);
}
down_read(&EXT4_I(inode)->i_mmap_sem);
- result = dax_iomap_pmd_fault(vma, addr, pmd, flags,
- &ext4_iomap_ops);
+ result = dax_iomap_pmd_fault(vma, vmf, &ext4_iomap_ops);
up_read(&EXT4_I(inode)->i_mmap_sem);
if (write)
sb_end_pagefault(sb);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 0f77519..332e194 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1432,9 +1432,7 @@ xfs_filemap_fault(
STATIC int
xfs_filemap_pmd_fault(
struct vm_area_struct *vma,
- unsigned long addr,
- pmd_t *pmd,
- unsigned int flags)
+ struct vm_fault *vmf)
{
struct inode *inode = file_inode(vma->vm_file);
struct xfs_inode *ip = XFS_I(inode);
@@ -1445,16 +1443,16 @@ xfs_filemap_pmd_fault(
trace_xfs_filemap_pmd_fault(ip);
- if (flags & FAULT_FLAG_WRITE) {
+ if (vmf->flags & FAULT_FLAG_WRITE) {
sb_start_pagefault(inode->i_sb);
file_update_time(vma->vm_file);
}
xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
- ret = dax_iomap_pmd_fault(vma, addr, pmd, flags, &xfs_iomap_ops);
+ ret = dax_iomap_pmd_fault(vma, vmf, &xfs_iomap_ops);
xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
- if (flags & FAULT_FLAG_WRITE)
+ if (vmf->flags & FAULT_FLAG_WRITE)
sb_end_pagefault(inode->i_sb);
return ret;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 24ad711..a829fee 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -71,16 +71,15 @@ static inline unsigned int dax_radix_order(void *entry)
return PMD_SHIFT - PAGE_SHIFT;
return 0;
}
-int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
- pmd_t *pmd, unsigned int flags, struct iomap_ops *ops);
+int dax_iomap_pmd_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+ struct iomap_ops *ops);
#else
static inline unsigned int dax_radix_order(void *entry)
{
return 0;
}
static inline int dax_iomap_pmd_fault(struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd, unsigned int flags,
- struct iomap_ops *ops)
+ struct vm_fault *vmf, struct iomap_ops *ops)
{
return VM_FAULT_FALLBACK;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index aaf4c96..1c96128 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -347,8 +347,7 @@ struct vm_operations_struct {
void (*close)(struct vm_area_struct * area);
int (*mremap)(struct vm_area_struct * area);
int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
- int (*pmd_fault)(struct vm_area_struct *, unsigned long address,
- pmd_t *, unsigned int flags);
+ int (*pmd_fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
void (*map_pages)(struct vm_fault *vmf,
pgoff_t start_pgoff, pgoff_t end_pgoff);
diff --git a/include/trace/events/fs_dax.h b/include/trace/events/fs_dax.h
index c3b0aae..a98665b 100644
--- a/include/trace/events/fs_dax.h
+++ b/include/trace/events/fs_dax.h
@@ -8,9 +8,8 @@
DECLARE_EVENT_CLASS(dax_pmd_fault_class,
TP_PROTO(struct inode *inode, struct vm_area_struct *vma,
- unsigned long address, unsigned int flags, pgoff_t pgoff,
- pgoff_t max_pgoff, int result),
- TP_ARGS(inode, vma, address, flags, pgoff, max_pgoff, result),
+ struct vm_fault *vmf, pgoff_t max_pgoff, int result),
+ TP_ARGS(inode, vma, vmf, max_pgoff, result),
TP_STRUCT__entry(
__field(unsigned long, ino)
__field(unsigned long, vm_start)
@@ -29,9 +28,9 @@ DECLARE_EVENT_CLASS(dax_pmd_fault_class,
__entry->vm_start = vma->vm_start;
__entry->vm_end = vma->vm_end;
__entry->vm_flags = vma->vm_flags;
- __entry->address = address;
- __entry->flags = flags;
- __entry->pgoff = pgoff;
+ __entry->address = vmf->address;
+ __entry->flags = vmf->flags;
+ __entry->pgoff = vmf->pgoff;
__entry->max_pgoff = max_pgoff;
__entry->result = result;
),
@@ -54,9 +53,9 @@ DECLARE_EVENT_CLASS(dax_pmd_fault_class,
#define DEFINE_PMD_FAULT_EVENT(name) \
DEFINE_EVENT(dax_pmd_fault_class, name, \
TP_PROTO(struct inode *inode, struct vm_area_struct *vma, \
- unsigned long address, unsigned int flags, pgoff_t pgoff, \
+ struct vm_fault *vmf, \
pgoff_t max_pgoff, int result), \
- TP_ARGS(inode, vma, address, flags, pgoff, max_pgoff, result))
+ TP_ARGS(inode, vma, vmf, max_pgoff, result))
DEFINE_PMD_FAULT_EVENT(dax_pmd_fault);
DEFINE_PMD_FAULT_EVENT(dax_pmd_fault_done);
diff --git a/mm/memory.c b/mm/memory.c
index 03691d6..e648c62 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3475,8 +3475,7 @@ static int create_huge_pmd(struct vm_fault *vmf)
if (vma_is_anonymous(vma))
return do_huge_pmd_anonymous_page(vmf);
if (vma->vm_ops->pmd_fault)
- return vma->vm_ops->pmd_fault(vma, vmf->address, vmf->pmd,
- vmf->flags);
+ return vma->vm_ops->pmd_fault(vma, vmf);
return VM_FAULT_FALLBACK;
}
@@ -3485,8 +3484,7 @@ static int wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
if (vma_is_anonymous(vmf->vma))
return do_huge_pmd_wp_page(vmf, orig_pmd);
if (vmf->vma->vm_ops->pmd_fault)
- return vmf->vma->vm_ops->pmd_fault(vmf->vma, vmf->address,
- vmf->pmd, vmf->flags);
+ return vmf->vma->vm_ops->pmd_fault(vmf->vma, vmf);
/* COW handled on pte level: split pmd */
VM_BUG_ON_VMA(vmf->vma->vm_flags & VM_SHARED, vmf->vma);
4 years
Re: linux-next 0112 tree breaks fs DAX
by Xiong Zhou
On Fri, Jan 13, 2017 at 06:16:41PM +0800, Xiong Zhou wrote:
> Hi,
>
> These cases "hang" when testing with -o dax mount option:
> xfstests generic/030 generic/34{0,4,5,6} generic/198
> (maybe more)
>
> The test programme holetest or aiodio keep running for a
> long time. It's killable, but it seems never return.
>
> With both ext4 and xfs as fs.
>
> Without -o dax mount option, cases pass in seconds.
>
> 0111 tree passed the tests.
>
> sh-4.2# git log --oneline next-20170111..next-20170112 fs/dax.c
> 0c9a7909dd13 mm, dax: change pmd_fault() to take only vmf parameter
> f8dbc198d4ea mm, dax: make pmd_fault() and friends be the same as fault()
0112 tree with above 2 commits reverted pass the tests.
4 years
[LSF/MM TOPIC] Memory hotplug, ZONE_DEVICE, and the future of struct page
by Dan Williams
Back when we were first attempting to support DMA for DAX mappings of
persistent memory the plan was to forgo 'struct page' completely and
develop a pfn-to-scatterlist capability for the dma-mapping-api. That
effort died in this thread:
https://lkml.org/lkml/2015/8/14/3
...where we learned that the dependencies on struct page for dma
mapping are deeper than a PFN_PHYS() conversion for some
architectures. That was the moment we pivoted to ZONE_DEVICE and
arranged for a 'struct page' to be available for any persistent memory
range that needs to be the target of DMA. ZONE_DEVICE enables any
device-driver that can target "System RAM" to also be able to target
persistent memory through a DAX mapping.
Since that time the "page-less" DAX path has continued to mature [1]
without growing new dependencies on struct page, but at the same time
continuing to rely on ZONE_DEVICE to satisfy get_user_pages().
Peer-to-peer DMA appears to be evolving from a niche embedded use case
to something general purpose platforms will need to comprehend. The
"map_peer_resource" [2] approach looks to be headed to the same
destination as the pfn-to-scatterlist effort. It's difficult to avoid
'struct page' for describing DMA operations without custom driver
code.
With that background, a statement and a question to discuss at LSF/MM:
General purpose DMA, i.e. any DMA setup through the dma-mapping-api,
requires pfn_to_page() support across the entire physical address
range mapped.
Is ZONE_DEVICE the proper vehicle for this? We've already seen that it
collides with platform alignment assumptions [3], and if there's a
wider effort to rework memory hotplug [4] it seems DMA support should
be part of the discussion.
---
This topic focuses on the mechanism to enable pfn_to_page() for an
arbitrary physical address range, and the proposed peer-to-peer DMA
topic [5] touches on the userspace presentation of this mechanism. I
might be good to combine these topics if there's interest? In any
event, I'm interested in both as well Michal's concern about memory
hotplug in general.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2016-November/007672.html
[2]: http://www.spinics.net/lists/linux-pci/msg44560.html
[3]: https://lkml.org/lkml/2016/12/1/740
[4]: http://www.spinics.net/lists/linux-mm/msg119369.html
[5]: http://marc.info/?l=linux-mm&m=148156541804940&w=2
4 years