[PATCH v2 0/4] libnvdimm: Cross-arch compatible namespace alignment
by Dan Williams
Changes since v1 [1]:
- Fix build errors with the PowerPC override for memremap_compat_align()
- Move the memremap_compat_align() override definition to
arch/powerpc/mm/ioremap.c (Aneesh)
[1]: http://lore.kernel.org/r/158041475480.3889308.655103391935006598.stgit@dw...
---
Explicit review requests, but any other feedback is of course
appreciated:
Patch1 needs an ack from ppc arch maintainers, and I'd like a tested-by
from Aneesh that this still works to solve the ppc issue. Jeff, does
this look good to you?
---
Aneesh reports that PowerPC requires 16MiB alignment for the address
range passed to devm_memremap_pages(), and Jeff reports that it is
possible to create a misaligned namespace which blocks future namespace
creation in that region. Both of these issues require namespace
alignment to be managed at the region level rather than padding at the
namespace level which has been a broken approach to date.
Introduce memremap_compat_align() to indicate the hard requirements of
an arch's memremap_pages() implementation. Use the maximum known
memremap_compat_align() to set the default namespace alignment for
libnvdimm. Consult that alignment when allocating free space. Finally,
allow the default region alignment to be overridden to maintain the same
namespace creation capability as previous kernels.
The ndctl unit tests, which have some misaligned namespace assumptions,
are updated to use the alignment override where necessary.
Thanks to Aneesh for early feedback and testing on this improved
alignment handling.
---
Dan Williams (4):
mm/memremap_pages: Introduce memremap_compat_align()
libnvdimm/namespace: Enforce memremap_compat_align()
libnvdimm/region: Introduce NDD_LABELING
libnvdimm/region: Introduce an 'align' attribute
arch/powerpc/Kconfig | 1
arch/powerpc/mm/ioremap.c | 12 +++
arch/powerpc/platforms/pseries/papr_scm.c | 2
drivers/acpi/nfit/core.c | 4 +
drivers/nvdimm/dimm.c | 2
drivers/nvdimm/dimm_devs.c | 95 +++++++++++++++++----
drivers/nvdimm/namespace_devs.c | 21 ++++-
drivers/nvdimm/nd.h | 3 -
drivers/nvdimm/pfn_devs.c | 2
drivers/nvdimm/region_devs.c | 132 ++++++++++++++++++++++++++---
include/linux/libnvdimm.h | 2
include/linux/memremap.h | 8 ++
include/linux/mmzone.h | 1
lib/Kconfig | 3 +
mm/memremap.c | 13 +++
15 files changed, 260 insertions(+), 41 deletions(-)
--
base-commit: 543506a2936aaced94bcc8731aae5d29d0442e90
11 months, 1 week
[GIT PULL] dax fixes for v5.6-rc2
by Dan Williams
Hi Linus, please pull from:
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
tags/dax-fixes-5.6-rc1
...to receive a fix for an xfstest failure and some and an update that
removes an fsdax dependency on block devices. The update is small
enough that I held it back to merge with the fix post -rc1 and let it
all appear in a -next release. No reported issues in -next.
---
The following changes since commit d1eef1c619749b2a57e514a3fa67d9a516ffa919:
Linux 5.5-rc2 (2019-12-15 15:16:08 -0800)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
tags/dax-fixes-5.6-rc1
for you to fetch changes up to 96222d53842dfe54869ec4e1b9d4856daf9105a2:
dax: pass NOWAIT flag to iomap_apply (2020-02-05 20:34:32 -0800)
----------------------------------------------------------------
dax fixes 5.6-rc1
- Fix RWF_NOWAIT writes to properly return -EAGAIN
- Clean up an unused helper
- Update dax_writeback_mapping_range to not need a block_device argument
----------------------------------------------------------------
Jeff Moyer (1):
dax: pass NOWAIT flag to iomap_apply
Vivek Goyal (2):
dax: Pass dax_dev instead of bdev to dax_writeback_mapping_range()
dax: Get rid of fs_dax_get_by_host() helper
drivers/dax/super.c | 2 +-
fs/dax.c | 11 ++++-------
fs/ext2/inode.c | 5 +++--
fs/ext4/inode.c | 2 +-
fs/xfs/xfs_aops.c | 2 +-
include/linux/dax.h | 14 ++------------
6 files changed, 12 insertions(+), 24 deletions(-)
11 months, 1 week
[PATCH v3 00/19][RFC] virtio-fs: Enable DAX support
by Vivek Goyal
Hi,
This patch series enables DAX support for virtio-fs filesystem. Patches
are based on 5.3-rc5 kernel and need first patch series posted for
virtio-fs support with subject "virtio-fs: shared file system for virtual
machines".
https://www.redhat.com/archives/virtio-fs/2019-August/msg00281.html
Enabling DAX seems to improve performance for most of the operations
in general a great deal. I have reported performance numbers in first patch
series so I am not repeating these here.
Any comments or feedback is welcome.
Thanks
Vivek
Sebastien Boeuf (3):
virtio: Add get_shm_region method
virtio: Implement get_shm_region for PCI transport
virtio: Implement get_shm_region for MMIO transport
Stefan Hajnoczi (4):
dax: remove block device dependencies
fuse, dax: add fuse_conn->dax_dev field
virtio_fs, dax: Set up virtio_fs dax_device
fuse, dax: add DAX mmap support
Vivek Goyal (12):
dax: Pass dax_dev to dax_writeback_mapping_range()
fuse: Keep a list of free dax memory ranges
fuse: implement FUSE_INIT map_alignment field
fuse: Introduce setupmapping/removemapping commands
fuse, dax: Implement dax read/write operations
fuse: Define dax address space operations
fuse, dax: Take ->i_mmap_sem lock during dax page fault
fuse: Maintain a list of busy elements
dax: Create a range version of dax_layout_busy_page()
fuse: Add logic to free up a memory range
fuse: Release file in process context
fuse: Take inode lock for dax inode truncation
drivers/dax/super.c | 3 +-
drivers/virtio/virtio_mmio.c | 32 +
drivers/virtio/virtio_pci_modern.c | 108 +++
fs/dax.c | 89 +-
fs/ext2/inode.c | 2 +-
fs/ext4/inode.c | 2 +-
fs/fuse/cuse.c | 3 +-
fs/fuse/dir.c | 2 +
fs/fuse/file.c | 1206 +++++++++++++++++++++++++++-
fs/fuse/fuse_i.h | 99 ++-
fs/fuse/inode.c | 138 +++-
fs/fuse/virtio_fs.c | 134 +++-
fs/xfs/xfs_aops.c | 2 +-
include/linux/dax.h | 12 +-
include/linux/virtio_config.h | 17 +
include/uapi/linux/fuse.h | 47 +-
include/uapi/linux/virtio_fs.h | 3 +
include/uapi/linux/virtio_mmio.h | 11 +
include/uapi/linux/virtio_pci.h | 11 +-
19 files changed, 1868 insertions(+), 53 deletions(-)
--
2.20.1
11 months, 1 week
Re: [PATCH RFC 09/10] vfio/type1: Use follow_pfn for VM_FPNMAP VMAs
by Joao Martins
On 2/7/20 9:08 PM, Jason Gunthorpe wrote:
> On Fri, Jan 10, 2020 at 07:03:12PM +0000, Joao Martins wrote:
>> From: Nikita Leshenko <nikita.leshchenko(a)oracle.com>
>>
>> Unconditionally interpreting vm_pgoff as a PFN is incorrect.
>>
>> VMAs created by /dev/mem do this, but in general VM_PFNMAP just means
>> that the VMA doesn't have an associated struct page and is being managed
>> directly by something other than the core mmu.
>>
>> Use follow_pfn like KVM does to find the PFN.
>>
>> Signed-off-by: Nikita Leshenko <nikita.leshchenko(a)oracle.com>
>> drivers/vfio/vfio_iommu_type1.c | 6 +++---
>> 1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index 2ada8e6cdb88..1e43581f95ea 100644
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -362,9 +362,9 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>> vma = find_vma_intersection(mm, vaddr, vaddr + 1);
>>
>> if (vma && vma->vm_flags & VM_PFNMAP) {
>> - *pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
>> - if (is_invalid_reserved_pfn(*pfn))
>> - ret = 0;
>> + ret = follow_pfn(vma, vaddr, pfn);
>> + if (!ret && !is_invalid_reserved_pfn(*pfn))
>> + ret = -EOPNOTSUPP;
>> }
>
> FWIW this existing code is a huge hack and a security problem.
>
> I'm not sure how you could be successfully using this path on actual
> memory without hitting bad bugs?
>
ATM I think this codepath is largelly hit at the moment for MMIO (GPU
passthrough, or mdev). In the context of this patch, guest memory would be
treated similarly meaning the device-dax backing memory wouldn't have a 'struct
page' (as introduced in this series).
> Fudamentally VFIO can't retain a reference to a page from within a VMA
> without some kind of recount/locking/etc to allow the thing that put
> the page there to know it is still being used (ie programmed in a
> IOMMU) by VFIO.
>
> Otherwise it creates use-after-free style security problems on the
> page.
>
I take it you're referring to the past problems with long term page pinning +
fsdax? Or you had something else in mind, perhaps related to your LSFMM topic?
Here the memory can't be used by the kernel (and there's no struct page) except
from device-dax managing/tearing/driving the pfn region (which is static and the
underlying PFNs won't change throughout device lifetime), and vfio
pinning/unpinning the pfns (which are refcounted against multiple map/unmaps);
> This code needs to be deleted, not extended :(
To some extent it isn't really an extension: the patch was just removing the
assumption @vm_pgoff being the 'start pfn' on PFNMAP vmas. This is also
similarly done by get_vaddr_frames().
Joao
11 months, 1 week
[RFC PATCH 0/5][V2] dax,pmem: Provide a dax operation to zero range of memory
by Vivek Goyal
Hi,
This is V2 of patches. I posted V1 here.
https://lore.kernel.org/linux-fsdevel/20200123165249.GA7664@redhat.com/
Changes since V1.
- Took care of feedback from Christoph.
- Made ->zero_page_range() mandatory operation.
- Provided a generic helper to zero range for non-pmem drivers.
- Merged __dax_zero_page_range() and iomap_dax_zero()
- Made changes to dm drivers.
- Limited range zeroing to with-in single page.
- Tested patches with real hardware.
description
-----------
This is an RFC patch series to provide a dax operation to zero a range of
memory. It will also clear poison in the process.
Motivation from this patch comes from Christoph's feedback that he will
rather prefer a dax way to zero a range instead of relying on having to
call blkdev_issue_zeroout() in __dax_zero_page_range().
https://lkml.org/lkml/2019/8/26/361
My motivation for this change is virtiofs DAX support. There we use DAX
but we don't have a block device. So any dax code which has the assumption
that there is always a block device associated is a problem. So this
is more of a cleanup of one of the places where dax has this dependency
on block device and if we add a dax operation for zeroing a range, it
can help with not having to call blkdev_issue_zeroout() in dax path.
Thanks
Vivek
Vivek Goyal (5):
dax, pmem: Add a dax operation zero_page_range
s390,dax: Add dax zero_page_range operation to dcssblk driver
dm,dax: Add dax zero_page_range operation
dax,iomap: Start using dax native zero_page_range()
dax,iomap: Add helper dax_iomap_zero() to zero a range
drivers/dax/super.c | 20 ++++++++++++
drivers/md/dm-linear.c | 18 +++++++++++
drivers/md/dm-log-writes.c | 17 ++++++++++
drivers/md/dm-stripe.c | 23 ++++++++++++++
drivers/md/dm.c | 30 ++++++++++++++++++
drivers/nvdimm/pmem.c | 50 +++++++++++++++++++++++++++++
drivers/s390/block/dcssblk.c | 7 ++++
fs/dax.c | 60 ++++++++++++++---------------------
fs/iomap/buffered-io.c | 9 +-----
include/linux/dax.h | 17 ++++++----
include/linux/device-mapper.h | 3 ++
11 files changed, 204 insertions(+), 50 deletions(-)
--
2.18.1
11 months, 2 weeks
[patch] dax: pass NOWAIT flag to iomap_apply
by Jeff Moyer
fstests generic/471 reports a failure when run with MOUNT_OPTIONS="-o
dax". The reason is that the initial pwrite to an empty file with the
RWF_NOWAIT flag set does not return -EAGAIN. It turns out that
dax_iomap_rw doesn't pass that flag through to iomap_apply.
With this patch applied, generic/471 passes for me.
Signed-off-by: Jeff Moyer <jmoyer(a)redhat.com>
diff --git a/fs/dax.c b/fs/dax.c
index 1f1f0201cad1..0b0d8819cb1b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1207,6 +1207,9 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
lockdep_assert_held(&inode->i_rwsem);
}
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ flags |= IOMAP_NOWAIT;
+
while (iov_iter_count(iter)) {
ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
iter, dax_iomap_actor);
11 months, 2 weeks
[PATCH 0/5] libnvdimm: Cross-arch compatible namespace alignment
by Dan Williams
Aneesh reports that PowerPC requires 16MiB alignment for the address
range passed to devm_memremap_pages(), and Jeff reports that it is
possible to create a misaligned namespace which blocks future namespace
creation in that region. Both of these issues require namespace
alignment to be managed at the region level rather than padding at the
namespace level which has been a broken approach to date.
Introduce memremap_compat_align() to indicate the hard requirements of
an arch's memremap_pages() implementation. Use the maximum known
memremap_compat_align() to set the default namespace alignment for
libnvdimm. Consult that alignment when allocating free space. Finally,
allow the default region alignment to be overridden to maintain the same
namespace creation capability as previous kernels.
The ndctl unit tests, which have some misaligned namespace assumptions,
are updated to use the alignment override where necessary.
Thanks to Aneesh for early feedback and testing on this improved
alignment handling.
---
Dan Williams (5):
mm/memremap_pages: Kill unused __devm_memremap_pages()
mm/memremap_pages: Introduce memremap_compat_align()
libnvdimm/namespace: Enforce memremap_compat_align()
libnvdimm/region: Introduce NDD_LABELING
libnvdimm/region: Introduce an 'align' attribute
arch/powerpc/include/asm/io.h | 10 ++
arch/powerpc/platforms/pseries/papr_scm.c | 2
drivers/acpi/nfit/core.c | 4 +
drivers/nvdimm/dimm.c | 2
drivers/nvdimm/dimm_devs.c | 95 +++++++++++++++++----
drivers/nvdimm/namespace_devs.c | 21 ++++-
drivers/nvdimm/nd.h | 3 -
drivers/nvdimm/pfn_devs.c | 2
drivers/nvdimm/region_devs.c | 132 ++++++++++++++++++++++++++---
include/linux/io.h | 23 +++++
include/linux/libnvdimm.h | 2
include/linux/mmzone.h | 1
12 files changed, 255 insertions(+), 42 deletions(-)
11 months, 2 weeks
[bug report] libnvdimm, nvdimm: dimm driver and base libnvdimm
device-driver infrastructure
by Dan Carpenter
Hello Dan Williams,
The patch 4d88a97aa9e8: "libnvdimm, nvdimm: dimm driver and base
libnvdimm device-driver infrastructure" from May 31, 2015, leads to
the following static checker warning:
drivers/nvdimm/bus.c:511 nd_async_device_register()
error: dereferencing freed memory 'dev'
drivers/nvdimm/bus.c
502 static void nd_async_device_register(void *d, async_cookie_t cookie)
503 {
504 struct device *dev = d;
505
506 if (device_add(dev) != 0) {
507 dev_err(dev, "%s: failed\n", __func__);
508 put_device(dev);
^^^^^^^^^^^^^^^
509 }
510 put_device(dev);
^^^^^^^^^^^^^^
511 if (dev->parent)
512 put_device(dev->parent);
513 }
We call get_device() from __nd_device_register(), I guess. It seems
buggy to call put device twice on error.
regards,
dan carpenter
regards,
dan carpenter
11 months, 2 weeks
[RFC] dax,pmem: Provide a dax operation to zero range of memory
by Vivek Goyal
Hi,
This is an RFC patch to provide a dax operation to zero a range of memory.
It will also clear poison in the process. This is primarily compile tested
patch. I don't have real hardware to test the poison logic. I am posting
this to figure out if this is the right direction or not.
Motivation from this patch comes from Christoph's feedback that he will
rather prefer a dax way to zero a range instead of relying on having to
call blkdev_issue_zeroout() in __dax_zero_page_range().
https://lkml.org/lkml/2019/8/26/361
My motivation for this change is virtiofs DAX support. There we use DAX
but we don't have a block device. So any dax code which has the assumption
that there is always a block device associated is a problem. So this
is more of a cleanup of one of the places where dax has this dependency
on block device and if we add a dax operation for zeroing a range, it
can help with not having to call blkdev_issue_zeroout() in dax path.
I have yet to take care of stacked block drivers (dm/md).
Current poison clearing logic is primarily written with assumption that
I/O is sector aligned. With this new method, this assumption is broken
and one can pass any range of memory to zero. I have fixed few places
in existing logic to be able to handle an arbitrary start/end. I am
not sure are there other dependencies which might need fixing or
prohibit us from providing this method.
Any feedback or comment is welcome.
Thanks
Vivek
---
drivers/dax/super.c | 13 +++++++++
drivers/nvdimm/pmem.c | 67 ++++++++++++++++++++++++++++++++++++++++++--------
fs/dax.c | 39 ++++++++---------------------
include/linux/dax.h | 3 ++
4 files changed, 85 insertions(+), 37 deletions(-)
Index: rhvgoyal-linux/drivers/nvdimm/pmem.c
===================================================================
--- rhvgoyal-linux.orig/drivers/nvdimm/pmem.c 2020-01-23 11:32:11.075139183 -0500
+++ rhvgoyal-linux/drivers/nvdimm/pmem.c 2020-01-23 11:32:28.660139183 -0500
@@ -52,8 +52,8 @@ static void hwpoison_clear(struct pmem_d
if (is_vmalloc_addr(pmem->virt_addr))
return;
- pfn_start = PHYS_PFN(phys);
- pfn_end = pfn_start + PHYS_PFN(len);
+ pfn_start = PFN_UP(phys);
+ pfn_end = PFN_DOWN(phys + len);
for (pfn = pfn_start; pfn < pfn_end; pfn++) {
struct page *page = pfn_to_page(pfn);
@@ -71,22 +71,24 @@ static blk_status_t pmem_clear_poison(st
phys_addr_t offset, unsigned int len)
{
struct device *dev = to_dev(pmem);
- sector_t sector;
+ sector_t sector_start, sector_end;
long cleared;
blk_status_t rc = BLK_STS_OK;
+ int nr_sectors;
- sector = (offset - pmem->data_offset) / 512;
+ sector_start = ALIGN((offset - pmem->data_offset), 512) / 512;
+ sector_end = ALIGN_DOWN((offset - pmem->data_offset + len), 512)/512;
+ nr_sectors = sector_end - sector_start;
cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len);
if (cleared < len)
rc = BLK_STS_IOERR;
- if (cleared > 0 && cleared / 512) {
+ if (cleared > 0 && nr_sectors > 0) {
hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
- cleared /= 512;
- dev_dbg(dev, "%#llx clear %ld sector%s\n",
- (unsigned long long) sector, cleared,
- cleared > 1 ? "s" : "");
- badblocks_clear(&pmem->bb, sector, cleared);
+ dev_dbg(dev, "%#llx clear %d sector%s\n",
+ (unsigned long long) sector_start, nr_sectors,
+ nr_sectors > 1 ? "s" : "");
+ badblocks_clear(&pmem->bb, sector_start, nr_sectors);
if (pmem->bb_state)
sysfs_notify_dirent(pmem->bb_state);
}
@@ -268,6 +270,50 @@ static const struct block_device_operati
.revalidate_disk = nvdimm_revalidate_disk,
};
+static int pmem_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
+ unsigned int offset, loff_t len)
+{
+ int rc = 0;
+ phys_addr_t phys_pos = pgoff * PAGE_SIZE + offset;
+ struct pmem_device *pmem = dax_get_private(dax_dev);
+ struct page *page = ZERO_PAGE(0);
+
+ do {
+ unsigned bytes, nr_sectors = 0;
+ sector_t sector_start, sector_end;
+ bool bad_pmem = false;
+ phys_addr_t pmem_off = phys_pos + pmem->data_offset;
+ void *pmem_addr = pmem->virt_addr + pmem_off;
+ unsigned int page_offset;
+
+ page_offset = offset_in_page(phys_pos);
+ bytes = min_t(loff_t, PAGE_SIZE - page_offset, len);
+
+ sector_start = ALIGN(phys_pos, 512)/512;
+ sector_end = ALIGN_DOWN(phys_pos + bytes, 512)/512;
+ if (sector_end > sector_start)
+ nr_sectors = sector_end - sector_start;
+
+ if (nr_sectors &&
+ unlikely(is_bad_pmem(&pmem->bb, sector_start,
+ nr_sectors * 512)))
+ bad_pmem = true;
+
+ write_pmem(pmem_addr, page, 0, bytes);
+ if (unlikely(bad_pmem)) {
+ rc = pmem_clear_poison(pmem, pmem_off, bytes);
+ write_pmem(pmem_addr, page, 0, bytes);
+ }
+ if (rc > 0)
+ return -EIO;
+
+ phys_pos += phys_pos + bytes;
+ len -= bytes;
+ } while (len > 0);
+
+ return 0;
+}
+
static long pmem_dax_direct_access(struct dax_device *dax_dev,
pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn)
{
@@ -299,6 +345,7 @@ static const struct dax_operations pmem_
.dax_supported = generic_fsdax_supported,
.copy_from_iter = pmem_copy_from_iter,
.copy_to_iter = pmem_copy_to_iter,
+ .zero_page_range = pmem_dax_zero_page_range,
};
static const struct attribute_group *pmem_attribute_groups[] = {
Index: rhvgoyal-linux/include/linux/dax.h
===================================================================
--- rhvgoyal-linux.orig/include/linux/dax.h 2020-01-23 11:25:23.814139183 -0500
+++ rhvgoyal-linux/include/linux/dax.h 2020-01-23 11:32:17.799139183 -0500
@@ -34,6 +34,8 @@ struct dax_operations {
/* copy_to_iter: required operation for fs-dax direct-i/o */
size_t (*copy_to_iter)(struct dax_device *, pgoff_t, void *, size_t,
struct iov_iter *);
+ /* zero_page_range: optional operation for fs-dax direct-i/o */
+ int (*zero_page_range)(struct dax_device *, pgoff_t, unsigned, loff_t);
};
extern struct attribute_group dax_attribute_group;
@@ -209,6 +211,7 @@ size_t dax_copy_from_iter(struct dax_dev
size_t bytes, struct iov_iter *i);
size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
size_t bytes, struct iov_iter *i);
+int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, unsigned offset, loff_t len);
void dax_flush(struct dax_device *dax_dev, void *addr, size_t size);
ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
Index: rhvgoyal-linux/fs/dax.c
===================================================================
--- rhvgoyal-linux.orig/fs/dax.c 2020-01-23 11:25:23.814139183 -0500
+++ rhvgoyal-linux/fs/dax.c 2020-01-23 11:32:17.801139183 -0500
@@ -1044,38 +1044,23 @@ static vm_fault_t dax_load_hole(struct x
return ret;
}
-static bool dax_range_is_aligned(struct block_device *bdev,
- unsigned int offset, unsigned int length)
-{
- unsigned short sector_size = bdev_logical_block_size(bdev);
-
- if (!IS_ALIGNED(offset, sector_size))
- return false;
- if (!IS_ALIGNED(length, sector_size))
- return false;
-
- return true;
-}
-
int __dax_zero_page_range(struct block_device *bdev,
struct dax_device *dax_dev, sector_t sector,
unsigned int offset, unsigned int size)
{
- if (dax_range_is_aligned(bdev, offset, size)) {
- sector_t start_sector = sector + (offset >> 9);
+ pgoff_t pgoff;
+ long rc, id;
- return blkdev_issue_zeroout(bdev, start_sector,
- size >> 9, GFP_NOFS, 0);
- } else {
- pgoff_t pgoff;
- long rc, id;
+ rc = bdev_dax_pgoff(bdev, sector, PAGE_SIZE, &pgoff);
+ if (rc)
+ return rc;
+
+ id = dax_read_lock();
+ rc = dax_zero_page_range(dax_dev, pgoff, offset, size);
+ if (rc == -EOPNOTSUPP) {
void *kaddr;
- rc = bdev_dax_pgoff(bdev, sector, PAGE_SIZE, &pgoff);
- if (rc)
- return rc;
-
- id = dax_read_lock();
+ /* If driver does not implement zero page range, fallback */
rc = dax_direct_access(dax_dev, pgoff, 1, &kaddr, NULL);
if (rc < 0) {
dax_read_unlock(id);
@@ -1083,9 +1068,9 @@ int __dax_zero_page_range(struct block_d
}
memset(kaddr + offset, 0, size);
dax_flush(dax_dev, kaddr + offset, size);
- dax_read_unlock(id);
}
- return 0;
+ dax_read_unlock(id);
+ return rc;
}
EXPORT_SYMBOL_GPL(__dax_zero_page_range);
Index: rhvgoyal-linux/drivers/dax/super.c
===================================================================
--- rhvgoyal-linux.orig/drivers/dax/super.c 2020-01-23 11:25:23.814139183 -0500
+++ rhvgoyal-linux/drivers/dax/super.c 2020-01-23 11:32:17.802139183 -0500
@@ -344,6 +344,19 @@ size_t dax_copy_to_iter(struct dax_devic
}
EXPORT_SYMBOL_GPL(dax_copy_to_iter);
+int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
+ unsigned offset, loff_t len)
+{
+ if (!dax_alive(dax_dev))
+ return 0;
+
+ if (!dax_dev->ops->zero_page_range)
+ return -EOPNOTSUPP;
+
+ return dax_dev->ops->zero_page_range(dax_dev, pgoff, offset, len);
+}
+EXPORT_SYMBOL_GPL(dax_zero_page_range);
+
#ifdef CONFIG_ARCH_HAS_PMEM_API
void arch_wb_cache_pmem(void *addr, size_t size);
void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
11 months, 2 weeks
[PATCH RFC 00/10] device-dax: Support devices without PFN metadata
by Joao Martins
Hey,
Presented herewith a small series which allows device-dax to work without
struct page to be used to back KVM guests memory. It's an RFC, and there's
still some items we're looking at (see TODO below); but wondering if folks
would be OK carving some time out of their busy schedules to provide feedback
direction-wise on this work.
In virtualized environments (specially those with no kernel-backed PV
interfaces, and just SR-IOV), memory is largelly assigned to guests: either
persistent with NVDIMMs or volatile for regular RAM. The kernel
(hypervisor) tracks it with 'struct page' (64b) for each 4K page. Overall
we're spending 16GB for each 1Tb of host memory tracked that the kernel won't
need which could instead be used to create other guests. One of motivations of
this series is to then get that memory used for 'struct page', when it is meant
to solely be used by userspace. This is also useful for the case of memory
backing guests virtual NVDIMMs. The other neat side effect is that the
hypervisor has no virtual mapping of the guest and hence code gadgets (if
found) are limited in their effectiveness.
It is expected that a smaller (instead of total) amount of host memory is
defined for the kernel (with mem=X or memmap=X!Y). For KVM userspace VMM (e.g.
QEMU), the main thing that is needed is a device which open + mmap + close with
a certain alignment (4K, 2M, 1G). That made us look at device-dax which does
just that and so the work comprised here was improving what's there and the
interfaces it uses.
The series is divided as follows:
* Patch 1 , 3: Preparatory work for patch 7 for adding support for
vmf_insert_{pmd,pud} with dax pfn flags PFN_DEV|PFN_SPECIAL
* Patch 2 , 4: Preparatory work for patch 7 for adding support for
follow_pfn() to work with 2M/1G huge pages, which is
what KVM uses for VM_PFNMAP.
* Patch 5 - 7: One bugfix and device-dax support for PFN_DEV|PFN_SPECIAL,
which encompasses mainly dealing with the lack of devmap,
and creating a VM_PFNMAP vma.
* Patch 8: PMEM support for no PFN metadata only for device-dax namespaces.
At the very end of the cover letter (after scissors mark),
there's a patch for ndctl to be able to create namespaces
with '--mode devdax --map none'.
* Patch 9: Let VFIO handle VM_PFNMAP without relying on vm_pgoff being
a PFN.
* Patch 10: The actual end consumer example for RAM case. The patch just adds a
label storage area which consequently allows namespaces to be
created. We picked PMEM legacy for starters.
Thoughts, coments appreciated.
Joao
P.S. As an example to try this out:
1) add 'memmap=48G!16G' to the kernel command line, on a host with 64G,
and kernel has 16G.
2) create a devdax namespace with 1G hugepages:
$ ndctl create-namespace --verbose --mode devdax --map none --size 32G --align 1G -r 0
{
"dev":"namespace0.0",
"mode":"devdax",
"map":"none",
"size":"32.00 GiB (34.36 GB)",
"uuid":"dfdd05cd-2611-46ac-8bcd-10b6194f32d4",
"daxregion":{
"id":0,
"size":"32.00 GiB (34.36 GB)",
"align":1073741824,
"devices":[
{
"chardev":"dax0.0",
"size":"32.00 GiB (34.36 GB)",
"target_node":0,
"mode":"devdax"
}
]
},
"align":1073741824
}
3) Add this to your qemu params:
-m 32G
-object memory-backend-file,id=mem,size=32G,mem-path=/dev/dax0.0,share=on,align=1G
-numa node,memdev=mem
TODO:
* Discontiguous regions/namespaces: The work above is limited to max
contiguous extent, coming from nvdimm dpa allocation heuristics -- which I take
is because of what specs allow for persistent namespaces. But for volatile RAM
case we would need handling of discontiguous extents (hence a region would represent
more than a resource) to be less bound to how guests are placed on the system.
I played around with multi-resource for device-dax, but I'm wondering about
UABI: 1) whether nvdimm DPA allocation heuristics should be relaxed for RAM
case (under certain nvdimm region bits); or if 2) device-dax would have it's
own separate UABI to be used by daxctl (which would be also useful for hmem
devices?).
* MCE handling: For contiguous regions vm_pgoff could be set to the pfn in
device-dax, which would allow collect_procs() to find the processes solely based
on the PFN. But for discontiguous namespaces, not sure if this would work; perhaps
looking at the dax-region pfn range for each DAX vma.
* NUMA: For now excluded setting the target_node; while these two patches
are being worked on[1][2].
[1] https://lore.kernel.org/lkml/157401276776.43284.12396353118982684546.stgi...
[2] https://lore.kernel.org/lkml/157401277293.43284.3805106435228534675.stgit...
Joao Martins (9):
mm: Add pmd support for _PAGE_SPECIAL
mm: Handle pmd entries in follow_pfn()
mm: Add pud support for _PAGE_SPECIAL
mm: Handle pud entries in follow_pfn()
device-dax: Do not enforce MADV_DONTFORK on mmap()
device-dax: Introduce pfn_flags helper
device-dax: Add support for PFN_SPECIAL flags
dax/pmem: Add device-dax support for PFN_MODE_NONE
nvdimm/e820: add multiple namespaces support
Nikita Leshenko (1):
vfio/type1: Use follow_pfn for VM_FPNMAP VMAs
arch/x86/include/asm/pgtable.h | 34 ++++-
drivers/dax/bus.c | 3 +-
drivers/dax/device.c | 78 ++++++++----
drivers/dax/pmem/core.c | 36 +++++-
drivers/nvdimm/e820.c | 212 ++++++++++++++++++++++++++++----
drivers/vfio/vfio_iommu_type1.c | 6 +-
mm/gup.c | 6 +
mm/huge_memory.c | 15 ++-
mm/memory.c | 67 ++++++++--
9 files changed, 382 insertions(+), 75 deletions(-)
8>----------------
Subject: [PATCH] ndctl: add 'devdax' support for NDCTL_PFN_LOC_NONE
diff --git a/ndctl/namespace.c b/ndctl/namespace.c
index 7fb00078646b..2568943eb207 100644
--- a/ndctl/namespace.c
+++ b/ndctl/namespace.c
@@ -206,6 +206,8 @@ static int set_defaults(enum device_action mode)
/* pass */;
else if (strcmp(param.map, "dev") == 0)
/* pass */;
+ else if (strcmp(param.map, "none") == 0)
+ /* pass */;
else {
error("invalid map location '%s'\n", param.map);
rc = -EINVAL;
@@ -755,9 +757,17 @@ static int validate_namespace_options(struct ndctl_region *region,
if (param.map) {
if (!strcmp(param.map, "mem"))
p->loc = NDCTL_PFN_LOC_RAM;
+ else if (!strcmp(param.map, "none"))
+ p->loc = NDCTL_PFN_LOC_NONE;
else
p->loc = NDCTL_PFN_LOC_PMEM;
+ if (p->loc == NDCTL_PFN_LOC_NONE
+ && p->mode != NDCTL_NS_MODE_DAX) {
+ debug("--map=none only valid for devdax mode namespace\n");
+ return -EINVAL;
+ }
+
if (ndns && p->mode != NDCTL_NS_MODE_MEMORY
&& p->mode != NDCTL_NS_MODE_DAX) {
debug("%s: --map= only valid for fsdax mode namespace\n",
--
2.17.1
11 months, 2 weeks