[ adding Thanu ]
On Thu, Feb 25, 2016 at 2:27 PM, Dave Chinner <david(a)fromorbit.com> wrote:
On Thu, Feb 25, 2016 at 03:57:14PM -0500, Jeff Moyer wrote:
> Good morning, Dave,
> Dave Chinner <david(a)fromorbit.com> writes:
> > On Thu, Feb 25, 2016 at 02:11:49PM -0500, Jeff Moyer wrote:
> >> Jeff Moyer <jmoyer(a)redhat.com> writes:
> >> >> The big issue we have right now is that we haven't made the
> >> >> infrastructure work correctly and reliably for general use. Hence
> >> >> adding new APIs to workaround cases where we haven't yet
> >> >> correct behaviour, let alone optimised for performance is, quite
> >> >> frankly, a clear case premature optimisation.
> >> >
> >> > Again, I see the two things as separate issues. You need both.
> >> > Implementing MAP_SYNC doesn't mean we don't have to solve the
> >> > issue of making existing applications work safely.
> >> I want to add one more thing to this discussion, just for the sake of
> >> clarity. When I talk about existing applications and pmem, I mean
> >> applications that already know how to detect and recover from torn
> >> sectors. Any application that assumes hardware does not tear sectors
> >> should be run on a file system layered on top of the btt.
> > Which turns off DAX, and hence makes this a moot discussion because
> You're missing the point. You can't take applications that don't know
> how to deal with torn sectors and put them on a block device that does
> not provide power fail write atomicity of a single sector.
Very few applications actually care about atomic sector writes.
Databases are probably the only class of application that really do
care about both single sector and multi-sector atomic write
behaviour, and many of them can be configured to assume single
sector writes can be torn.
Torn user data writes have always been possible, and so pmem does
not introduce any new semantics that applications have to handle.
> > Keep in mind that existing storage technologies tear fileystem data
> > writes, too, because user data writes are filesystem block sized and
> > not atomic at the device level (i.e. typical is 512 byte sector, 4k
> > filesystem block size, so there are 7 points in a single write where
> > a tear can occur on a crash).
> You are conflating torn pages (pages being a generic term for anything
> greater than a sector) and torn sectors.
No, I'm not. I'm pointing out that applications that really care
about data integrity already have the capability to recovery from
torn sectors in the event of a crash. pmem+DAX does not introduce
any new way of corrupting user data for these applications.
> > IOWs existing storage already has the capability of tearing user
> > data on crash and has been doing so for a least they last 30 years.
> And yet applications assume that this doesn't happen. Have a look at
"All versions of SQLite up to and including version 3.7.9 assume
that the filesystem does not provide powersafe overwrite. [...]
Hence it seems reasonable to assume powersafe overwrite for modern
disks. [...] Caution is advised though. As Roger Binns noted on the
SQLite developers mailing list: "'poorly written' should be the main
assumption about drive firmware."
IOWs, SQLite used to always assume that single sector overwrites can
be torn, and now that it is optional it recommends that users should
assume this is the way their storage behaves in order to be safe. In
this config, it uses the write ahead log even for single sector
writes, and hence can recover from torn sector writes without having
to detect that the write was torn.
"SQLite never assumes that database page writes are atomic,
regardless of the PSOW setting.(1) And hence SQLite is always able
to automatically recover from torn pages induced by a crash."
This is Because multi-sector writes are always staged through the
write ahead log and hence are cleanly recoverable after a crash
without having to detect whether a torn write occurred or not.
IOWs, you've just pointed to an application that demonstrates
pmem-safe behaviour - just configure the database files with
"file:somefile.db?psow=0" and it will assume that individual sector
writes can be torn, and it will always recover.
Hence I'm not sure exactly what point you are trying to make with
I met Thanu today at USENIX Fast'16 today and his research  has
found other applications that assume sector atomicity. Also, here's a
thread he pointed to about the sector atomicity dependencies of LMDB
BTT is needed because existing software assumes sectors are not torn
and may not yet have settings like "psow=0" to workaround that
assumption. Jeff's right, we would be mistaken not to recommend BTT
by default. In that respect applications running on top of raw pmem,
sans BTT, are already making a "I know what I am doing" decision in