On Tue, 2016-04-26 at 09:25 +1000, Dave Chinner wrote:
<>
>
> - It checks badblocks and discovers it's files have lost data
Lots of hand-waving here. How does the application map a bad
"sector" to a file without scanning the entire filesystem to find
the owner of the bad sector?
Yes this was hand-wavey, but we talked about this a bit at LSF..
The idea is that a per-block-device badblocks list is available at
/sys/block/<pmemX>/badblocks. The application (or a suitable yet-to-be-
written library function) does a fiemap to figure out the sectors its
files are using, and correlates the two lists.
We can also look into providing an easier-to-use interface from the
kernel, in the form of an fiemap flag to report only the bad sectors, or
a SEEK_BAD flag..
The application doesn't have to scan the entire filesystem, but
presumably it knows what files it 'owns', and does a fiemap for those.
>
> - It write()s those sectors (possibly converted to file offsets
> using
> fiemap)
> * This triggers the fallback path, but if the application is
> doing
> this level of recovery, it will know the sector is bad, and write
> the
> entire sector
Where does the application find the data that was lost to be able to
rewrite it?
The data that was lost is gone -- this assumes the application has some
ability to recover using a journal/log or other redundancy - yes, at the
application layer. If it doesn't have this sort of capability, the only
option is to restore files from a backup/mirror.
>
> - Or it replaces the entire file from backup also using write() (not
> mmap+stores)
> * This just frees the fs block, and the next time the block is
> reallocated by the fs, it will likely be zeroed first, and that will
> be
> done through the driver and will clear errors
There's an implicit assumption that applications will keep redundant
copies of their data at the /application layer/ and be able to
automatically repair it? And then there's the implicit assumption
that it will unlink and free the entire file before writing a new
copy, and that then assumes the the filesystem will zero blocks if
they get reused to clear errors on that LBA sector mapping before
they are accessible again to userspace..
It seems to me that there are a number of assumptions being made
across multiple layers here. Maybe I've missed something - can you
point me to the design/architecture description so I can see how
"app does data recovery itself" dance is supposed to work?
There isn't a document other than the flow in my head :) - but maybe I
could write one up..
I wasn't thinking the application itself maintains and restores from
backup copy of the file.. The application hits either a SIGBUS or EIO
depending on how it accesses the data, and crashes or raises some alarm.
The recovery is then done out-of-band, by a sysadmin or such (i.e.
delete the file, replace with a known good copy, restart application).
To summarize, the two cases we want to handle are:
1. Application has inbuilt recovery:
- hits badblock
- figures out it is able to recover the data
- handles SIGBUS or EIO
- does a (sector aligned) write() to restore the data
2. Application doesn't have any inbuilt recovery mechanism
- hits badblock
- gets SIGBUS (or EIO) and crashes
- Sysadmin restores file from backup
Case 1 is handled by either a fallback to direct_IO from dax_do_io, or
always _actually_ doing direct_IO when we're opened with O_DIRECT in
spite of dax (what Dan suggested). Currently if we're mounted with dax,
all IO O_DIRECT or otherwise will go through dax_do_io.
Case 2 is handled by patch 4 of the series:
dax: use sb_issue_zerout instead of calling dax_clear_sectors
>
> Cheers,
>
> Dave.