On Tue, Jan 12, 2016 at 11:57:16AM +0100, Jan Kara wrote:
On Thu 07-01-16 22:27:56, Ross Zwisler wrote:
> To properly handle fsync/msync in an efficient way DAX needs to track dirty
> pages so it is able to flush them durably to media on demand.
>
> The tracking of dirty pages is done via the radix tree in struct
> address_space. This radix tree is already used by the page writeback
> infrastructure for tracking dirty pages associated with an open file, and
> it already has support for exceptional (non struct page*) entries. We
> build upon these features to add exceptional entries to the radix tree for
> DAX dirty PMD or PTE pages at fault time.
>
> Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Some comments below.
> ---
> fs/dax.c | 194 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> include/linux/dax.h | 2 +
> mm/filemap.c | 6 ++
> 3 files changed, 196 insertions(+), 6 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 5b84a46..0db21ea 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -24,6 +24,7 @@
> #include <linux/memcontrol.h>
> #include <linux/mm.h>
> #include <linux/mutex.h>
> +#include <linux/pagevec.h>
> #include <linux/pmem.h>
> #include <linux/sched.h>
> #include <linux/uio.h>
> @@ -324,6 +325,174 @@ static int copy_user_bh(struct page *to, struct inode *inode,
> return 0;
> }
>
> +#define NO_SECTOR -1
> +
> +static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
> + sector_t sector, bool pmd_entry, bool dirty)
> +{
> + struct radix_tree_root *page_tree = &mapping->page_tree;
> + int type, error = 0;
> + void *entry;
> +
> + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +
> + spin_lock_irq(&mapping->tree_lock);
> + entry = radix_tree_lookup(page_tree, index);
> +
> + if (entry) {
> + type = RADIX_DAX_TYPE(entry);
> + if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
> + type != RADIX_DAX_PMD)) {
> + error = -EIO;
> + goto unlock;
> + }
> +
> + if (!pmd_entry || type == RADIX_DAX_PMD)
> + goto dirty;
> + radix_tree_delete(&mapping->page_tree, index);
> + mapping->nrexceptional--;
In theory, you can delete here DIRTY / TOWRITE PTE entry and insert a clean
PMD entry instead of it. That will cause fsync() to miss some flushes. So
you should make sure you transfer all the tags to the new entry.
Ah, great catch, I'll address it in v9 which I'll send out tomorrow.
> +static int dax_writeback_one(struct block_device *bdev,
> + struct address_space *mapping, pgoff_t index, void *entry)
> +{
> + struct radix_tree_root *page_tree = &mapping->page_tree;
> + int type = RADIX_DAX_TYPE(entry);
> + struct radix_tree_node *node;
> + struct blk_dax_ctl dax;
> + void **slot;
> + int ret = 0;
> +
> + spin_lock_irq(&mapping->tree_lock);
> + /*
> + * Regular page slots are stabilized by the page lock even
> + * without the tree itself locked. These unlocked entries
> + * need verification under the tree lock.
> + */
> + if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> + goto unlock;
> + if (*slot != entry)
> + goto unlock;
> +
> + /* another fsync thread may have already written back this entry */
> + if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> + goto unlock;
> +
> + radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> +
> + if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
> + ret = -EIO;
> + goto unlock;
> + }
> +
> + dax.sector = RADIX_DAX_SECTOR(entry);
> + dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
> + spin_unlock_irq(&mapping->tree_lock);
This seems to be somewhat racy as well - if there are two fsyncs running
against the same inode, one wins the race and clears TOWRITE tag, the
second then bails out and may finish before the skipped page gets flushed.
So we should clear the TOWRITE tag only after the range is flushed. This
can result in some amount of duplicit flushing but I don't think the race
will happen that frequently in practice to be performance relevant.
Yep, this make sense. I'll also fix that in v9.
And secondly: You must write-protect all mappings of the flushed
range so
that you get fault when the sector gets written-to again. We spoke about
this in the past already but somehow it got lost and I forgot about it as
well. You need something like rmap_walk_file()...
The code that write protected mappings and then cleaned the radix tree entries
did get written, and was part of v2:
https://lkml.org/lkml/2015/11/13/759
I removed all the code that cleaned PTE entries and radix tree entries for v3.
The reason behind this was that there was a race that I couldn't figure out
how to solve between the cleaning of the PTEs and the cleaning of the radix
tree entries.
The race goes like this:
Thread 1 (write) Thread 2 (fsync)
================ ================
wp_pfn_shared()
pfn_mkwrite()
dax_radix_entry()
radix_tree_tag_set(DIRTY)
dax_writeback_mapping_range()
dax_writeback_one()
radix_tag_clear(DIRTY)
pgoff_mkclean()
... return up to wp_pfn_shared()
wp_page_reuse()
pte_mkdirty()
After this sequence we end up with a dirty PTE that is writeable, but with a
clean radix tree entry. This means that users can write to the page, but that
a follow-up fsync or msync won't flush this dirty data to media.
The overall issue is that in the write path that goes through wp_pfn_shared(),
the DAX code has control over when the radix tree entry is dirtied but not
when the PTE is made dirty and writeable. This happens up in wp_page_reuse().
This means that we can't easily add locking, etc. to protect ourselves.
I spoke a bit about this with Dave Chinner and with Dave Hansen, but no really
easy solutions presented themselves in the absence of a page lock. I do have
one idea, but I think it's pretty invasive and will need to wait for another
kernel cycle.
The current code that leaves the radix tree entry will give us correct
behavior - it'll just be less efficient because we will have an ever-growing
dirty set to flush.
> + /*
> + * We cannot hold tree_lock while calling dax_map_atomic() because it
> + * eventually calls cond_resched().
> + */
> + ret = dax_map_atomic(bdev, &dax);
> + if (ret < 0)
> + return ret;
> +
> + if (WARN_ON_ONCE(ret < dax.size)) {
> + ret = -EIO;
> + goto unmap;
> + }
> +
> + wb_cache_pmem(dax.addr, dax.size);
> + unmap:
> + dax_unmap_atomic(bdev, &dax);
> + return ret;
> +
> + unlock:
> + spin_unlock_irq(&mapping->tree_lock);
> + return ret;
> +}
...
> @@ -791,15 +976,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> * dax_pfn_mkwrite - handle first write to DAX page
> * @vma: The virtual memory area where the fault occurred
> * @vmf: The description of the fault
> - *
> */
> int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> {
> - struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> + struct file *file = vma->vm_file;
>
> - sb_start_pagefault(sb);
> - file_update_time(vma->vm_file);
> - sb_end_pagefault(sb);
> + dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
Why is NO_SECTOR argument correct here?
Right - so NO_SECTOR means "I expect there to already be an entry in the radix
tree - just make that entry dirty". This works because pfn_mkwrite() always
follows a normal __dax_fault() or __dax_pmd_fault() call. These fault calls
will insert the radix tree entry, regardless of whether the fault was for a
read or a write. If the fault was for a write, the radix tree entry will also
be made dirty.
For reads the radix tree entry will be inserted but left clean. When the
first write happens we will get a pfn_mkwrite() call, which will call
dax_radix_entry() with the NO_SECTOR argument. This will look up the radix
tree entry & set the dirty tag.
This also has the added benefit that the pfn_mkwrite() path can remain minimal
- if we needed to actually insert a radix tree entry with sector information
we'd have to duplicate a bunch of the fault path code so that we could call
get_block().