On Sun, Oct 1, 2017 at 12:57 AM, Christoph Hellwig <hch(a)lst.de> wrote:
While this looks like a really nice cleanup of the code and removes
nasty race conditions I'd like to understand the tradeoffs.
This now requires every dax device that is used with a file system
to have a struct page backing, which means not only means we'd
break existing setups, but also a sharp turn from previous policy.
Unless I misremember it was you Intel guys that heavily pushed for
the page-less version, so I'd like to understand why you've changed
Sure, here's a quick recap of the story so far of how we got here:
* In support of page-less I/O operations envisioned by Matthew I
introduced pfn_t as a proposal for converting the block layer and
other sub-systems to use pfns instead of pages . You helped out on
that patch set with some work on the DMA api. 
* The DMA api conversion effort came to a halt when it came time to
touch sparc paths and DaveM said : "Generally speaking, I think
that all actual physical memory the kernel operates on should have a
struct page backing it."
* ZONE_DEVICE was created to solve the DMA problem and in developing /
testing that discovered plenty of proof for Dave's assertion (no fork,
no ptrace, etc). We should have made the switch to require struct page
at that point, but I was persuaded by the argument that changing the
dax policy may break existing assumptions, and that there were larger
issues to go solve at the time.
What changed recently was the discussions around what the dax mount
option means and the assertion that we can, in general, make some
policy changes on our way to removing the "experimental" designation
from filesystem-dax. It is clear that the page-less dax path remains
experimental with all the way it fails in several kernel paths, and
there has been no patches for several months to revive the effort.
Meanwhile the page-less path continues to generate maintenance
overhead. The recent gymnastics (new ->post_mmap file_operation) to
make sure ->vm_flags are safely manipulated when dynamically changing
the dax mode of a file was the final straw for me to pull the trigger
on this series.
In terms of what breaks by changing this policy it should be noted
that we automatically create pages for "legacy" pmem devices, and the
default for "ndctl create-namespace" is to allocate pages. I have yet
to see a bug report where someone was surprised by fork failing or
direct-I/O causing a SIGBUS. So, I think the defaults are working, it
is unlikely that there are environments dependent on page-less
That said, I now recall that dax also replaced xip for some setups. I
think we have a couple options here: let embedded configurations
override the page requirement since they can reasonably assert to not
care about the several broken general purpose paths that need pages,
or perhaps follow in the footsteps of what Nicolas is doing for cramfs
where he calls dax "overkill"  for his use case.