Good morning, Dave,
Dave Chinner <david(a)fromorbit.com> writes:
On Thu, Feb 25, 2016 at 02:11:49PM -0500, Jeff Moyer wrote:
> Jeff Moyer <jmoyer(a)redhat.com> writes:
>
> >> The big issue we have right now is that we haven't made the DAX/pmem
> >> infrastructure work correctly and reliably for general use. Hence
> >> adding new APIs to workaround cases where we haven't yet provided
> >> correct behaviour, let alone optimised for performance is, quite
> >> frankly, a clear case premature optimisation.
> >
> > Again, I see the two things as separate issues. You need both.
> > Implementing MAP_SYNC doesn't mean we don't have to solve the bigger
> > issue of making existing applications work safely.
>
> I want to add one more thing to this discussion, just for the sake of
> clarity. When I talk about existing applications and pmem, I mean
> applications that already know how to detect and recover from torn
> sectors. Any application that assumes hardware does not tear sectors
> should be run on a file system layered on top of the btt.
Which turns off DAX, and hence makes this a moot discussion because
You're missing the point. You can't take applications that don't know
how to deal with torn sectors and put them on a block device that does
not provide power fail write atomicity of a single sector. That said,
there are two classes of applications that /can/ make use of file
systems layered on top of /dev/pmem devices:
1) applications that know how to deal with torn sectors
2) these new-fangled applications written for persistent memory
Thus, it's not a moot point. There are existing applications that can
make use of the msync/fsync code we've been discussing. And then there
are these other applications that want to take care of the persistence
all on their own.
Keep in mind that existing storage technologies tear fileystem data
writes, too, because user data writes are filesystem block sized and
not atomic at the device level (i.e. typical is 512 byte sector, 4k
filesystem block size, so there are 7 points in a single write where
a tear can occur on a crash).
You are conflating torn pages (pages being a generic term for anything
greater than a sector) and torn sectors. That point aside, you can do
O_DIRECT I/O on a sector granularity, even on a file system that has a
block size larger than the device logical block size. Thus,
applications can control the blast radius of a write.
IOWs existing storage already has the capability of tearing user
data on crash and has been doing so for a least they last 30 years.
And yet applications assume that this doesn't happen. Have a look at
this:
https://www.sqlite.org/psow.html
Hence I really don't see any fundamental difference here with
pmem+DAX - the only difference is that the tear granuarlity is
smaller (CPU cacheline rather than sector).
Like it or not, applications have been assuming that they get power fail
write atomicity of a single sector, and they have (mostly) been right.
With persistent memory, I am certain there will be torn writes. We've
already seen it in testing. This is why I don't see file systems on a
pmem device as general purpose.
Irrespective of what storage systems do today, I think it's good
practice to not leave landmines for applications that will use
persistent memory. Let's be very clear on what is expected to work and
what isn't. I hope I've made my stance clear.
Cheers,
Jeff