Re: [PATCH 5/5] block: enable dax for raw block devices
by Dan Williams
On Mon, Oct 26, 2015 at 3:23 PM, Dave Chinner <david(a)fromorbit.com> wrote:
> On Mon, Oct 26, 2015 at 11:48:06AM +0900, Dan Williams wrote:
>> On Mon, Oct 26, 2015 at 6:22 AM, Dave Chinner <david(a)fromorbit.com> wrote:
>> > On Thu, Oct 22, 2015 at 11:08:18PM +0200, Jan Kara wrote:
>> >> Ugh2: Now I realized that DAX mmap isn't safe wrt fs freezing even for
>> >> filesystems since there's nothing which writeprotects pages that are
>> >> writeably mapped. In normal path, page writeback does this but that doesn't
>> >> happen for DAX. I remember we once talked about this but it got lost.
>> >> We need something like walk all filesystem inodes during fs freeze and
>> >> writeprotect all pages that are mapped. But that's going to be slow...
>> >
>> > fsync() has the same problem - we have no record of the pages that
>> > need to be committed and then write protected when fsync() is called
>> > after write()...
>>
>> I know Ross is still working on that implementation. However, I had a
>> thought on the flight to ksummit that maybe we shouldn't worry about
>> tracking dirty state on a per-page basis. For small / frequent
>> synchronizations an application really should be using the nvml
>> library [1] to issue cache flushes and pcommit from userspace on a
>> per-cacheline basis. That leaves unmodified apps that want to be
>> correct in the presence of dax mappings. Two things we can do to
>> mitigate that case:
>>
>> 1/ Make DAX mappings opt-in with a new MMAP_DAX (page-cache bypass)
>> flag. Applications shouldn't silently become incorrect simply because
>> the fs is mounted with -o dax. If an app doesn't understand DAX
>> mappings it should get page-cache semantics. This also protects apps
>> that are not expecting DAX semantics on raw block device mappings.
>
> Which is the complete opposite of what we are trying to acehive with
> DAX. i.e. that existing applications "just work" with DAX without
> modification. So this is a non-starter.
The list of things DAX breaks is getting shorter, but certainly the
page-less paths will not be without surprises for quite a while yet...
> Also, DAX access isn't a property of mmap - it's a property
> of the inode. We cannot do DAX access via mmap while mixing page
> cache based access through file descriptor based interfaces. This
> I why I'm adding an inode attribute (on disk) to enable per-file DAX
> capabilities - either everything is via the DAX paths, or nothing
> is.
>
Per-inode control sounds very useful, I'll look at a similar mechanism
for the raw block case.
However, still not quite convinced page-cache control is an inode-only
property, especially when direct-i/o is not an inode-property. That
said, I agree the complexity of handling mixed mappings of the same
file is prohibitive.
>> 2/ Even if we get a new flag that lets the kernel know the app
>> understands DAX mappings, we shouldn't leave fsync broken. Can we
>> instead get by with a simple / big hammer solution? I.e.
>
> Because we don't physically have to write back data the problem is
> both simpler and more complex. The simplest solution is for the
> underlying block device to implement blkdev_issue_flush() correctly.
>
> i.e. if blkdev_issue_flush() behaves according to it's required
> semantics - that all volatile cached data is flushed to stable
> storage - then fsync-on-DAX will work appropriately. As it is, this is
> needed for journal based filesystems to work correctly, as they are
> assuming that their journal writes are being treated correctly as
> REQ_FLUSH | REQ_FUA to ensure correct data/metadata/journal
> ordering is maintained....
>
> So, to begin with, this problem needs to be solved at the block
> device level. That's the simple, brute-force, big hammer solution to
> the problem, and it requires no changes at the filesystem level at
> all.
>
> However, to avoid having to flush the entire block device range on
> fsync we need a much more complex solution that tracks the dirty
> ranges of the file and hence what needs committing when fsync is
> run....
>
>> Disruptive, yes, but if an app cares about efficient persistent memory
>> synchronization fsync is already the wrong api.
>
> I don't really care about efficiency right now - correctness comes
> first. Fundamentally, the app should not care whether it is writing to
> persistent memory or spinning rust - the filesystem needs to
> provide the application with exactly the same integrity guarantees
> regardless of the underlying storage.
>
Sounds good, get blkdev_issue_flush() functional first and then worry
about building a more efficient solution on top.
5 years, 2 months
烽浩舒
by 舒烽浩
h3px8nzbfwcxrnxxku2z6gr3dk7kys
5 years, 2 months
[PATCH v2 UPDATE 3/3] ACPI/APEI/EINJ: Allow memory error injection to NVDIMM
by Toshi Kani
In the case of memory error injection, einj_error_inject() checks
if a target address is regular RAM. Update this check to add a call
to region_intersects_pmem() to verify if a target address range is
NVDIMM. This allows injecting a memory error to both RAM and NVDIMM
for testing.
Also, the current RAM check, page_is_ram(), is replaced with
region_intersects_ram() so that it can verify a target address
range with the requested size.
Signed-off-by: Toshi Kani <toshi.kani(a)hpe.com>
Reviewed-by: Dan Williams <dan.j.williams(a)intel.com>
---
Add a blank line before if-statement. (Borislav Petkov)
---
drivers/acpi/apei/einj.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/drivers/acpi/apei/einj.c b/drivers/acpi/apei/einj.c
index 0431883..db21efe 100644
--- a/drivers/acpi/apei/einj.c
+++ b/drivers/acpi/apei/einj.c
@@ -519,7 +519,7 @@ static int einj_error_inject(u32 type, u32 flags, u64 param1, u64 param2,
u64 param3, u64 param4)
{
int rc;
- unsigned long pfn;
+ u64 base_addr, size;
/* If user manually set "flags", make sure it is legal */
if (flags && (flags &
@@ -545,10 +545,15 @@ static int einj_error_inject(u32 type, u32 flags, u64 param1, u64 param2,
/*
* Disallow crazy address masks that give BIOS leeway to pick
* injection address almost anywhere. Insist on page or
- * better granularity and that target address is normal RAM.
+ * better granularity and that target address is normal RAM or
+ * NVDIMM.
*/
- pfn = PFN_DOWN(param1 & param2);
- if (!page_is_ram(pfn) || ((param2 & PAGE_MASK) != PAGE_MASK))
+ base_addr = param1 & param2;
+ size = (~param2) + 1;
+
+ if (((region_intersects_ram(base_addr, size) != REGION_INTERSECTS) &&
+ (region_intersects_pmem(base_addr, size) != REGION_INTERSECTS)) ||
+ ((param2 & PAGE_MASK) != PAGE_MASK))
return -EINVAL;
inject:
5 years, 2 months
[PATCH v2 0/3] Allow EINJ to inject memory error to NVDIMM
by Toshi Kani
This patch-set extends the EINJ driver to allow injecting a memory
error to NVDIMM. It first extends iomem resource interface to support
checking a NVDIMM region.
Patch 1/3 changes region_intersects() to accept non-RAM regions, and
adds region_intersects_ram().
Patch 2/3 adds region_intersects_pmem() to check a NVDIMM region.
Patch 3/3 changes the EINJ driver to allow injecting a memory error
to NVDIMM.
---
v2:
- Change the EINJ driver to call region_intersects_ram() for checking
RAM with a specified size. (Dan Williams)
- Add export to region_intersects_ram().
---
Toshi Kani (3):
1/3 resource: Add @flags to region_intersects()
2/3 resource: Add region_intersects_pmem()
3/3 ACPI/APEI/EINJ: Allow memory error injection to NVDIMM
---
drivers/acpi/apei/einj.c | 12 ++++++++----
include/linux/mm.h | 5 ++++-
kernel/memremap.c | 5 ++---
kernel/resource.c | 35 ++++++++++++++++++++++++++++-------
4 files changed, 42 insertions(+), 15 deletions(-)
5 years, 2 months