[PATCH] mm: make GUP handle pfn mapping unless FOLL_GET is requested
by Kirill A. Shutemov
With DAX, pfn mapping becoming more common. The patch adjusts GUP code
to cover pfn mapping for cases when we don't need struct page to
proceed.
To make it possible, let's change follow_page() code to return -EEXIST
error code if proper page table entry exists, but no corresponding
struct page. __get_user_page() would ignore the error code and move to
the next page frame.
The immediate effect of the change is working MAP_POPULATE and mlock()
on DAX mappings.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Reviewed-by: Toshi Kani <toshi.kani(a)hp.com>
Cc: Matthew Wilcox <willy(a)linux.intel.com>
---
mm/gup.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 48 insertions(+), 10 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index 222d57e335f9..03645f400748 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -33,6 +33,30 @@ static struct page *no_page_table(struct vm_area_struct *vma,
return NULL;
}
+static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
+ pte_t *pte, unsigned int flags)
+{
+ /* No page to get reference */
+ if (flags & FOLL_GET)
+ return -EFAULT;
+
+ if (flags & FOLL_TOUCH) {
+ pte_t entry = *pte;
+
+ if (flags & FOLL_WRITE)
+ entry = pte_mkdirty(entry);
+ entry = pte_mkyoung(entry);
+
+ if (!pte_same(*pte, entry)) {
+ set_pte_at(vma->vm_mm, address, pte, entry);
+ update_mmu_cache(vma, address, pte);
+ }
+ }
+
+ /* Proper page table entry exists, but no corresponding struct page */
+ return -EEXIST;
+}
+
static struct page *follow_page_pte(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd, unsigned int flags)
{
@@ -74,10 +98,21 @@ retry:
page = vm_normal_page(vma, address, pte);
if (unlikely(!page)) {
- if ((flags & FOLL_DUMP) ||
- !is_zero_pfn(pte_pfn(pte)))
- goto bad_page;
- page = pte_page(pte);
+ if (flags & FOLL_DUMP) {
+ /* Avoid special (like zero) pages in core dumps */
+ page = ERR_PTR(-EFAULT);
+ goto out;
+ }
+
+ if (is_zero_pfn(pte_pfn(pte))) {
+ page = pte_page(pte);
+ } else {
+ int ret;
+
+ ret = follow_pfn_pte(vma, address, ptep, flags);
+ page = ERR_PTR(ret);
+ goto out;
+ }
}
if (flags & FOLL_GET)
@@ -115,12 +150,9 @@ retry:
unlock_page(page);
}
}
+out:
pte_unmap_unlock(ptep, ptl);
return page;
-bad_page:
- pte_unmap_unlock(ptep, ptl);
- return ERR_PTR(-EFAULT);
-
no_page:
pte_unmap_unlock(ptep, ptl);
if (!pte_none(pte))
@@ -490,9 +522,15 @@ retry:
goto next_page;
}
BUG();
- }
- if (IS_ERR(page))
+ } else if (PTR_ERR(page) == -EEXIST) {
+ /*
+ * Proper page table entry exists, but no corresponding
+ * struct page.
+ */
+ goto next_page;
+ } else if (IS_ERR(page)) {
return i ? i : PTR_ERR(page);
+ }
if (pages) {
pages[i] = page;
flush_anon_page(vma, page, start);
--
2.1.4
5 years, 7 months
[PATCH 00/15] libnvdimm: ->rw_bytes(), BLK-mode, unit tests, and misc features
by Dan Williams
This patchset takes the position that a new block_device_operations op
is needed for nvdimm devices. Jens, see "[PATCH 01/15] block: introduce
an ->rw_bytes() block device operation", it gates the rest of the series
moving forward.
Aside from adding a compile-time check to tools/testing/nvdimm/Kbuild
for validating all libnvdimm objects are built as modules, patches 2 to
6 are otherwise unchanged from the v6 libnvdimm posting [1]. The
remaining patches are feature additions and other cleanups that were
being held back while the base patchset was polished.
Patch 5 has an updated changelog speaking to the potential maintenance
burden of carrying tools/testing/nvdimm/ in-tree. The benefits still
outweigh the risks in my opinion.
It should be noted that "[PATCH 14/15] libnvdimm: support read-only btt
backing devices" was developed in direct repsonse to working through the
implementation of unit tests for "[PATCH 15/15] libnvdimm, nfit: handle
acpi_nfit_memory_map flags" and its new "read-only by default" policy.
See the updates to the libndctl unit tests posted on the
linux-nvdimm(a)01.org mailing list.
[PATCH 01/15] block: introduce an ->rw_bytes() block device operation
[PATCH 02/15] libnvdimm: infrastructure for btt devices
[PATCH 03/15] nd_btt: atomic sector updates
[PATCH 04/15] libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory
[PATCH 05/15] tools/testing/nvdimm: libnvdimm unit test infrastructure
[PATCH 06/15] libnvdimm: Non-Volatile Devices
[PATCH 07/15] fs/block_dev.c: skip rw_page if bdev has integrity
[PATCH 08/15] libnvdimm, btt: add support for blk integrity
[PATCH 09/15] libnvdimm, blk: add support for blk integrity
[PATCH 10/15] libnvdimm: fix up max_hw_sectors
[PATCH 11/15] libnvdimm: pmem, blk, and btt make_request cleanups
[PATCH 12/15] libnvdimm: enable iostat
[PATCH 13/15] libnvdimm: flag libnvdimm block devices as non-rotational
[PATCH 14/15] libnvdimm: support read-only btt backing devices
[PATCH 15/15] libnvdimm, nfit: handle acpi_nfit_memory_map flags
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-June/001166.html
---
Dan Williams (10):
block: introduce an ->rw_bytes() block device operation
libnvdimm: infrastructure for btt devices
tools/testing/nvdimm: libnvdimm unit test infrastructure
libnvdimm: Non-Volatile Devices
libnvdimm: fix up max_hw_sectors
libnvdimm: pmem, blk, and btt make_request cleanups
libnvdimm: enable iostat
libnvdimm: flag libnvdimm block devices as non-rotational
libnvdimm: support read-only btt backing devices
libnvdimm, nfit: handle acpi_nfit_memory_map flags
Ross Zwisler (1):
libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory
Vishal Verma (4):
nd_btt: atomic sector updates
fs/block_dev.c: skip rw_page if bdev has integrity
libnvdimm, btt: add support for blk integrity
libnvdimm, blk: add support for blk integrity
Documentation/nvdimm/btt.txt | 273 ++++++
Documentation/nvdimm/nvdimm.txt | 805 +++++++++++++++++
MAINTAINERS | 39 +
drivers/acpi/nfit.c | 491 ++++++++++
drivers/acpi/nfit.h | 58 +
drivers/nvdimm/Kconfig | 54 +
drivers/nvdimm/Makefile | 7
drivers/nvdimm/blk.c | 368 ++++++++
drivers/nvdimm/btt.c | 1569 +++++++++++++++++++++++++++++++++
drivers/nvdimm/btt.h | 185 ++++
drivers/nvdimm/btt_devs.c | 473 ++++++++++
drivers/nvdimm/bus.c | 176 ++++
drivers/nvdimm/core.c | 99 ++
drivers/nvdimm/dimm_devs.c | 9
drivers/nvdimm/namespace_devs.c | 63 +
drivers/nvdimm/nd-core.h | 48 +
drivers/nvdimm/nd.h | 61 +
drivers/nvdimm/pmem.c | 58 +
drivers/nvdimm/region.c | 97 ++
drivers/nvdimm/region_devs.c | 106 ++
fs/block_dev.c | 4
include/linux/blkdev.h | 44 +
include/linux/libnvdimm.h | 30 +
include/uapi/linux/ndctl.h | 2
tools/testing/nvdimm/Kbuild | 40 +
tools/testing/nvdimm/Makefile | 7
tools/testing/nvdimm/config_check.c | 15
tools/testing/nvdimm/test/Kbuild | 8
tools/testing/nvdimm/test/iomap.c | 151 +++
tools/testing/nvdimm/test/nfit.c | 1115 +++++++++++++++++++++++
tools/testing/nvdimm/test/nfit_test.h | 29 +
31 files changed, 6422 insertions(+), 62 deletions(-)
create mode 100644 Documentation/nvdimm/btt.txt
create mode 100644 Documentation/nvdimm/nvdimm.txt
create mode 100644 drivers/nvdimm/blk.c
create mode 100644 drivers/nvdimm/btt.c
create mode 100644 drivers/nvdimm/btt.h
create mode 100644 drivers/nvdimm/btt_devs.c
create mode 100644 tools/testing/nvdimm/Kbuild
create mode 100644 tools/testing/nvdimm/Makefile
create mode 100644 tools/testing/nvdimm/config_check.c
create mode 100644 tools/testing/nvdimm/test/Kbuild
create mode 100644 tools/testing/nvdimm/test/iomap.c
create mode 100644 tools/testing/nvdimm/test/nfit.c
create mode 100644 tools/testing/nvdimm/test/nfit_test.h
5 years, 7 months
Boost your Rankings with our High PR Dofollow Backlinks
by Cheap SEO Packs
All High PR Do-follow Links in one Package!
Only Dofollow most efficient backlinks.
Our High PR Do follow package washes away all those
worries you might have about recent Google updates.
- 30 High PR DofolIow [10 PR9 + 10 PR8 + 10 PR7]
- 30 edu & gov PR9
- 75 High PR dofollow, actual page PR 2-6
- 280 Angela Dofollow PR 4-8
For Full Details please read the attached .html file
Unsubscribe option is available on the footer of our website
5 years, 7 months
Boost your Rankings with our High PR Dofollow Backlinks
by Cheap SEO Packs
All High PR Do-follow Links in one Package!
Only Dofollow most efficient backlinks.
Our High PR Do follow package washes away all those
worries you might have about recent Google updates.
- 30 High PR DofolIow [10 PR9 + 10 PR8 + 10 PR7]
- 20 edu & gov PR9
- 60 High PR dofollow, actual page PR 2-7
- 300 Angela Dofollow PR 4-8
For Full Details please read the attached .html file
Unsubscribe option is available on the footer of our website
5 years, 7 months
[PATCH] mm: Fix MAP_POPULATE and mlock() for DAX
by Toshi Kani
DAX has the following issues in a shared or read-only private
mmap'd file.
- mmap(MAP_POPULATE) does not pre-fault
- mlock() fails with -ENOMEM
DAX uses VM_MIXEDMAP for mmap'd files, which do not have struct
page associated with the ranges. Both MAP_POPULATE and mlock()
call __mm_populate(), which in turn calls __get_user_pages().
Because __get_user_pages() requires a valid page returned from
follow_page_mask(), MAP_POPULATE and mlock(), i.e. FOLL_POPULATE,
fail in the first page.
Change __get_user_pages() to proceed FOLL_POPULATE when the
translation is set but its page does not exist (-EFAULT), and
@pages is not requested. With that, MAP_POPULATE and mlock()
set translations to the requested range and complete successfully.
MAP_POPULATE still provides a major performance improvement to
DAX as it will avoid page faults during initial access to the
pages.
mlock() continues to set VM_LOCKED to vma and populate the range.
Since there is no struct page, the range is pinned without marking
pages mlocked.
Note, MAP_POPULATE and mlock() already work for a write-able
private mmap'd file on DAX since populate_vma_page_range() breaks
COW, which allocates page caches.
Signed-off-by: Toshi Kani <toshi.kani(a)hp.com>
---
mm/gup.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)
diff --git a/mm/gup.c b/mm/gup.c
index 6297f6b..16d536f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -490,8 +490,20 @@ retry:
}
BUG();
}
- if (IS_ERR(page))
+ if (IS_ERR(page)) {
+ /*
+ * No page may be associated with VM_MIXEDMAP. Proceed
+ * FOLL_POPULATE when the translation is set but its
+ * page does not exist (-EFAULT), and @pages is not
+ * requested by the caller.
+ */
+ if ((PTR_ERR(page) == -EFAULT) && (!pages) &&
+ (gup_flags & FOLL_POPULATE) &&
+ (vma->vm_flags & VM_MIXEDMAP))
+ goto next_page;
+
return i ? i : PTR_ERR(page);
+ }
if (pages) {
pages[i] = page;
flush_anon_page(vma, page, start);
5 years, 7 months
[PATCH 0/3] mm, x86: Fix ioremap RAM check interfaces
by Toshi Kani
ioremap() checks if a target range is RAM and fails the request
if true. There are multiple issues in the iormap RAM check
interfaces.
1. region_is_ram() does not work at all.
2. The RAM checks, region_is_ram() and __ioremap_caller() via
walk_system_ram_range(), are redundant.
3. walk_system_ram_range() requires the RAM ranges page-aligned
in the resource table. This restriction has allowed multiple
ioremap calls to setup_data, which is not page-aligned.
This patchset solves issue 1 and 2. Issue 3 is not addressed in
this patchset, but is taken into the account that ioremap continues
to allow such callers until it is addressed.
---
Toshi Kani (3):
1/3 mm, x86: Fix warning in ioremap RAM check
2/3 mm, x86: Remove region_is_ram() call from ioremap
3/3 mm: Fix bugs in region_is_ram()
---
arch/x86/mm/ioremap.c | 23 ++++++-----------------
kernel/resource.c | 6 +++---
2 files changed, 9 insertions(+), 20 deletions(-)
5 years, 7 months
[PATCH 1/7] drivers/block/pmem: Add a driver for persistent memory
by David Nyström
From: Ross Zwisler <ross.zwisler(a)linux.intel.com>
This is a combination of 4 commits.
drivers/block/pmem: Add a driver for persistent memory
Commit-ID: 9e853f2313e5eb163cb1ea461b23c2332cf6438a
Gitweb: http://git.kernel.org/tip/9e853f2313e5eb163cb1ea461b23c2332cf6438a
Author: Ross Zwisler <ross.zwisler(a)linux.intel.com>
AuthorDate: Wed, 1 Apr 2015 09:12:19 +0200
Committer: Ingo Molnar <mingo(a)kernel.org>
CommitDate: Wed, 1 Apr 2015 17:03:56 +0200
PMEM is a new driver that presents a reserved range of memory as
a block device. This is useful for developing with NV-DIMMs,
and can be used with volatile memory as a development platform.
This patch contains the initial driver from Ross Zwisler, with
various changes: converted it to use a platform_device for
discovery, fixed partition support and merged various patches
from Boaz Harrosh.
Tested-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Signed-off-by: Christoph Hellwig <hch(a)lst.de>
Acked-by: Dan Williams <dan.j.williams(a)intel.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Andy Lutomirski <luto(a)amacapital.net>
Cc: Boaz Harrosh <boaz(a)plexistor.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: H. Peter Anvin <hpa(a)zytor.com>
Cc: Jens Axboe <axboe(a)fb.com>
Cc: Jens Axboe <axboe(a)kernel.dk>
Cc: Keith Busch <keith.busch(a)intel.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Matthew Wilcox <willy(a)linux.intel.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: linux-nvdimm(a)ml01.01.org
Link: http://lkml.kernel.org/r/1427872339-6688-3-git-send-email-hch@lst.de
[ Minor cleanups. ]
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
pmem: Add prints at pmem_probe/remove
Add small prints at creation/remove of pmem devices.
So we can see in dmesg logs when users loaded/unloaded
the pmem driver and what devices were created.
The prints will look like this:
Printed by e820 on load:
[ +0.000000] user: [mem 0x0000000100000000-0x000000015fffffff] persistent (type 12)
[ +0.000000] user: [mem 0x0000000160000000-0x00000001dfffffff] persistent (type 12)
...
Printed by modprobe pmem:
[ +0.003065] pmem pmem.0.auto: probe [0x0000000100000000:0x60000000]
[ +0.001816] pmem pmem.1.auto: probe [0x0000000160000000:0x80000000]
...
Printed by modprobe -r pmem:
[ +16.299145] pmem pmem.1.auto: remove
[ +0.011155] pmem pmem.0.auto: remove
Signed-off-by: Boaz Harrosh <boaz(a)plexistor.com>
pmem: Split out pmem_mapmem from pmem_alloc
I need this as a preparation for supporting different
mapping schema later.
Signed-off-by: Boaz Harrosh <boaz(a)plexistor.com>
pmem: Support map= module param
Introduce a new map= module param for the pmem driver.
The map= param is an alternative way to create pmem
device. If map= is left empty (default) then the
platform devices will be loaded just as before.
But if map= is not empty, the platform devices
will not be considered and only the ranges specified
at map= will be created.
map= param is of the form:
map=mapS[,mapS...]
where mapS=nn[KMG]$ss[KMG],
or mapS=nn[KMG]@ss[KMG],
nn=size, ss=offset
Just like the Kernel command line map && memmap parameters,
so anything you did at grub just copy/paste to here.
The "@" form is exactly the same as the "$" form only that
at bash prompt we need to escape the "$" with \$ so also
support the '@' char for convenience.
For each specified mapS there will be a device created.
On unload of driver all successfully created devices
will be unloaded.
NOTE: If at least one mapS creation is successful then
the modprobe will return success, and the driver will
stay loaded. However on first error the loading stops.
Some error messages might be displayed in dmesg.
Signed-off-by: Boaz Harrosh <boaz(a)plexistor.com>
---
MAINTAINERS | 6 +
drivers/block/Kconfig | 11 ++
drivers/block/Makefile | 1 +
drivers/block/pmem.c | 377 +++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 395 insertions(+)
create mode 100644 drivers/block/pmem.c
diff --git a/MAINTAINERS b/MAINTAINERS
index d3b1571..d5bf0da 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6711,6 +6711,12 @@ S: Maintained
F: Documentation/blockdev/ramdisk.txt
F: drivers/block/brd.c
+PERSISTENT MEMORY DRIVER
+M: Ross Zwisler <ross.zwisler(a)linux.intel.com>
+L: linux-nvdimm(a)lists.01.org
+S: Supported
+F: drivers/block/pmem.c
+
RANDOM NUMBER DRIVER
M: Theodore Ts'o" <tytso(a)mit.edu>
S: Maintained
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index b81ddfe..860e8d1 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -387,6 +387,17 @@ config BLK_DEV_XIP
will prevent RAM block device backing store memory from being
allocated from highmem (only a problem for highmem systems).
+config BLK_DEV_PMEM
+ tristate "Persistent memory block device support"
+ help
+ Saying Y here will allow you to use a contiguous range of reserved
+ memory as one or more persistent block devices.
+
+ To compile this driver as a module, choose M here: the module will be
+ called 'pmem'.
+
+ If unsure, say N.
+
config CDROM_PKTCDVD
tristate "Packet writing on CD/DVD media"
depends on !UML
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index ca07399..6256f6e 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -14,6 +14,7 @@ obj-$(CONFIG_PS3_VRAM) += ps3vram.o
obj-$(CONFIG_ATARI_FLOPPY) += ataflop.o
obj-$(CONFIG_AMIGA_Z2RAM) += z2ram.o
obj-$(CONFIG_BLK_DEV_RAM) += brd.o
+obj-$(CONFIG_BLK_DEV_PMEM) += pmem.o
obj-$(CONFIG_BLK_DEV_LOOP) += loop.o
obj-$(CONFIG_BLK_CPQ_DA) += cpqarray.o
obj-$(CONFIG_BLK_CPQ_CISS_DA) += cciss.o
diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
new file mode 100644
index 0000000..f756e5b
--- /dev/null
+++ b/drivers/block/pmem.c
@@ -0,0 +1,377 @@
+/*
+ * Persistent Memory Driver
+ *
+ * Copyright (c) 2014, Intel Corporation.
+ * Copyright (c) 2015, Christoph Hellwig <hch(a)lst.de>.
+ * Copyright (c) 2015, Boaz Harrosh <boaz(a)plexistor.com>.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#include <asm/cacheflush.h>
+#include <linux/blkdev.h>
+#include <linux/hdreg.h>
+#include <linux/init.h>
+#include <linux/platform_device.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/slab.h>
+
+#define PMEM_MINORS 16
+
+struct pmem_device {
+ struct list_head pmem_list;
+ struct request_queue *pmem_queue;
+ struct gendisk *pmem_disk;
+
+ /* One contiguous memory region per device */
+ phys_addr_t phys_addr;
+ void *virt_addr;
+ size_t size;
+};
+
+static int pmem_major;
+static atomic_t pmem_index;
+
+static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
+ unsigned int len, unsigned int off, int rw,
+ sector_t sector)
+{
+ void *mem = kmap_atomic(page);
+ size_t pmem_off = sector << 9;
+
+ if (rw == READ) {
+ memcpy(mem + off, pmem->virt_addr + pmem_off, len);
+ flush_dcache_page(page);
+ } else {
+ flush_dcache_page(page);
+ memcpy(pmem->virt_addr + pmem_off, mem + off, len);
+ }
+
+ kunmap_atomic(mem);
+}
+
+static void pmem_make_request(struct request_queue *q, struct bio *bio)
+{
+ struct block_device *bdev = bio->bi_bdev;
+ struct pmem_device *pmem = bdev->bd_disk->private_data;
+ int rw;
+ struct bio_vec bvec;
+ sector_t sector;
+ struct bvec_iter iter;
+ int err = 0;
+
+ if (bio_end_sector(bio) > get_capacity(bdev->bd_disk)) {
+ err = -EIO;
+ goto out;
+ }
+
+ BUG_ON(bio->bi_rw & REQ_DISCARD);
+
+ rw = bio_data_dir(bio);
+ sector = bio->bi_iter.bi_sector;
+ bio_for_each_segment(bvec, bio, iter) {
+ pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len, bvec.bv_offset,
+ rw, sector);
+ sector += bvec.bv_len >> 9;
+ }
+
+out:
+ bio_endio(bio, err);
+}
+
+static int pmem_rw_page(struct block_device *bdev, sector_t sector,
+ struct page *page, int rw)
+{
+ struct pmem_device *pmem = bdev->bd_disk->private_data;
+
+ pmem_do_bvec(pmem, page, PAGE_CACHE_SIZE, 0, rw, sector);
+ page_endio(page, rw & WRITE, 0);
+
+ return 0;
+}
+
+static long pmem_direct_access(struct block_device *bdev, sector_t sector,
+ void **kaddr, unsigned long *pfn, long size)
+{
+ struct pmem_device *pmem = bdev->bd_disk->private_data;
+ size_t offset = sector << 9;
+
+ if (!pmem)
+ return -ENODEV;
+
+ *kaddr = pmem->virt_addr + offset;
+ *pfn = (pmem->phys_addr + offset) >> PAGE_SHIFT;
+
+ return pmem->size - offset;
+}
+
+static const struct block_device_operations pmem_fops = {
+ .owner = THIS_MODULE,
+ .rw_page = pmem_rw_page,
+ .direct_access = pmem_direct_access,
+};
+
+/* pmem->phys_addr and pmem->size need to be set.
+ * Will then set virt_addr if successful.
+ */
+static int pmem_mapmem(struct pmem_device *pmem, struct device *dev)
+{
+ if (!request_mem_region(pmem->phys_addr, pmem->size, "pmem")) {
+ dev_warn(dev, "could not reserve region [0x%llx:0x%zx]\n",
+ pmem->phys_addr, pmem->size);
+ return -EINVAL;
+ }
+
+ /*
+ * Map the memory as non-cachable, as we can't write back the contents
+ * of the CPU caches in case of a crash.
+ */
+ pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size);
+ if (!pmem->virt_addr) {
+ dev_warn(dev, "could not ioremap_nocache [0x%llx:0x%zx]\n",
+ pmem->phys_addr, pmem->size);
+ release_mem_region(pmem->phys_addr, pmem->size);
+ return -ENXIO;
+ }
+
+ return 0;
+}
+
+static void pmem_unmapmem(struct pmem_device *pmem)
+{
+ if (unlikely(!pmem->virt_addr))
+ return;
+
+ iounmap(pmem->virt_addr);
+ release_mem_region(pmem->phys_addr, pmem->size);
+ pmem->virt_addr = NULL;
+}
+
+static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res)
+{
+ struct pmem_device *pmem;
+ struct gendisk *disk;
+ int idx, err;
+
+ err = -ENOMEM;
+ pmem = kzalloc(sizeof(*pmem), GFP_KERNEL);
+ if (!pmem)
+ goto out;
+
+ pmem->phys_addr = res->start;
+ pmem->size = resource_size(res);
+
+ err = pmem_mapmem(pmem, dev);
+ if (err)
+ goto out_free_dev;
+
+ pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL);
+ if (!pmem->pmem_queue)
+ goto out_unmap;
+
+ blk_queue_make_request(pmem->pmem_queue, pmem_make_request);
+ blk_queue_max_hw_sectors(pmem->pmem_queue, 1024);
+ blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
+
+ disk = alloc_disk(PMEM_MINORS);
+ if (!disk)
+ goto out_free_queue;
+
+ idx = atomic_inc_return(&pmem_index) - 1;
+
+ disk->major = pmem_major;
+ disk->first_minor = PMEM_MINORS * idx;
+ disk->fops = &pmem_fops;
+ disk->private_data = pmem;
+ disk->queue = pmem->pmem_queue;
+ disk->flags = GENHD_FL_EXT_DEVT;
+ sprintf(disk->disk_name, "pmem%d", idx);
+ disk->driverfs_dev = dev;
+ set_capacity(disk, pmem->size >> 9);
+ pmem->pmem_disk = disk;
+
+ add_disk(disk);
+
+ return pmem;
+
+out_free_queue:
+ blk_cleanup_queue(pmem->pmem_queue);
+out_unmap:
+ pmem_unmapmem(pmem);
+out_free_dev:
+ kfree(pmem);
+out:
+ return ERR_PTR(err);
+}
+
+static void pmem_free(struct pmem_device *pmem)
+{
+ del_gendisk(pmem->pmem_disk);
+ put_disk(pmem->pmem_disk);
+ blk_cleanup_queue(pmem->pmem_queue);
+ pmem_unmapmem(pmem);
+ kfree(pmem);
+}
+
+static int pmem_probe(struct platform_device *pdev)
+{
+ struct pmem_device *pmem;
+ struct resource *res;
+
+ if (WARN_ON(pdev->num_resources > 1))
+ return -ENXIO;
+
+ res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+ if (!res)
+ return -ENXIO;
+
+ pmem = pmem_alloc(&pdev->dev, res);
+ if (IS_ERR(pmem))
+ return PTR_ERR(pmem);
+
+ platform_set_drvdata(pdev, pmem);
+ dev_info(&pdev->dev, "probe [%pa:0x%zx]\n",
+ &pmem->phys_addr, pmem->size);
+
+ return 0;
+}
+
+static int pmem_remove(struct platform_device *pdev)
+{
+ struct pmem_device *pmem = platform_get_drvdata(pdev);
+
+ dev_info(&pdev->dev, "remove\n");
+ pmem_free(pmem);
+ return 0;
+}
+
+static struct platform_driver pmem_driver = {
+ .probe = pmem_probe,
+ .remove = pmem_remove,
+ .driver = {
+ .owner = THIS_MODULE,
+ .name = "pmem",
+ },
+};
+
+static char *map;
+module_param(map, charp, S_IRUGO);
+MODULE_PARM_DESC(map,
+ "pmem device mapping: map=mapS[,mapS...] where:\n"
+ "mapS=nn[KMG]$ss[KMG] or mapS=nn[KMG]@ss[KMG], nn=size, ss=offset.");
+
+static LIST_HEAD(pmem_devices);
+
+static int __init
+pmem_parse_map_one(char *map, phys_addr_t *start, size_t *size)
+{
+ char *p = map;
+
+ *size = (size_t)memparse(p, &p);
+ if ((p == map) || ((*p != '$') && (*p != '@')))
+ return -EINVAL;
+
+ if (!*(++p))
+ return -EINVAL;
+
+ *start = (phys_addr_t)memparse(p, &p);
+
+ return *p == '\0' ? 0 : -EINVAL;
+}
+
+static int __init _load_from_map(void)
+{
+ struct pmem_device *pmem;
+ char *p, *pmem_map, *map_dup;
+ int err = -ENODEV;
+
+ map_dup = pmem_map = kstrdup(map, GFP_KERNEL);
+ if (unlikely(!pmem_map)) {
+ pr_debug("pmem_init strdup(%s) failed\n", map);
+ return -ENOMEM;
+ }
+
+ while ((p = strsep(&pmem_map, ",")) != NULL) {
+ struct resource res = {.start = 0};
+ size_t disk_size;
+
+ if (!*p)
+ continue;
+ err = pmem_parse_map_one(p, &res.start, &disk_size);
+ if (err)
+ goto out;
+ /*TODO: check alignments */
+
+ res.end = res.start + disk_size - 1;
+ pmem = pmem_alloc(NULL, &res);
+ if (IS_ERR(pmem)) {
+ err = PTR_ERR(pmem);
+ goto out;
+ }
+ list_add_tail(&pmem->pmem_list, &pmem_devices);
+ }
+
+out:
+ /* If we have at least one device we stay loaded and rmmod can
+ * clean those that were loaded.
+ */
+ if (!list_empty(&pmem_devices))
+ err = 0;
+
+ pr_info("pmem: init map=%s successful(%d) => %d\n",
+ map, atomic_read(&pmem_index), err);
+ kfree(map_dup);
+ return err;
+}
+
+void _unload_from_map(void)
+{
+ struct pmem_device *pmem, *next;
+
+ list_for_each_entry_safe(pmem, next, &pmem_devices, pmem_list) {
+ list_del(&pmem->pmem_list);
+ pmem_free(pmem);
+ }
+
+ pr_info("pmem: exit\n");
+}
+
+static int __init pmem_init(void)
+{
+ int error;
+
+ pmem_major = register_blkdev(0, "pmem");
+ if (pmem_major < 0)
+ return pmem_major;
+
+ if (map && *map)
+ return _load_from_map();
+
+ error = platform_driver_register(&pmem_driver);
+ if (error)
+ unregister_blkdev(pmem_major, "pmem");
+ return error;
+}
+module_init(pmem_init);
+
+static void pmem_exit(void)
+{
+ if (list_empty(&pmem_devices))
+ platform_driver_unregister(&pmem_driver);
+ else
+ _unload_from_map();
+
+ unregister_blkdev(pmem_major, "pmem");
+}
+module_exit(pmem_exit);
+
+MODULE_AUTHOR("Ross Zwisler <ross.zwisler(a)linux.intel.com>");
+MODULE_LICENSE("GPL v2");
--
2.1.1
5 years, 7 months
Persistent memory interface
by Mikulas Patocka
Hi
I looked at the new the persistent memory block device driver
(drivers/block/pmem.c and arch/x86/kernel/pmem.c) and it seems that the
interface between them is incorrect.
If I want to use persistent memory in another driver, for a different
purpose, how can I make sure that that drivers/block/pmem.c doesn't attach
to this piece of memory and export it? It seems not possible.
drivers/block/pmem.c attaches to everything without regard that there may
be other users of persistent memory.
I think a correct solution would be to add a partition table at the
beginning of persistent memory area and this partition table would
describe which parts belong to which programs - so that different programs
could use persistent memory and not step over each other's data. Is there
some effort to standardize the partition table ongoing?
BTW. some journaling filesystems assume that 512-byte sector is written
atomically. drivers/block/pmem.c breaks this requirement. Persistent
memory only gurantees 8-byte atomic writes.
Mikulas
5 years, 7 months