[PATCH v3 0/2] Support ACPI 6.1 update in NFIT Control Region Structure
by Toshi Kani
ACPI 6.1, Table 5-133, updates NVDIMM Control Region Structure as
follows.
- Valid Fields, Manufacturing Location, and Manufacturing Date
are added from reserved range. No change in the structure size.
- IDs (SPD values) are stored as arrays of bytes (i.e. big-endian
format). The spec clarifies that they need to be represented
as arrays of bytes as well.
Patch 1 changes the NFIT driver to comply with ACPI 6.1.
Patch 2 adds a new sysfs file "id" to show NVDIMM ID defined in ACPI 6.1.
The patch-set applies on linux-pm.git acpica.
link: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
---
v3:
- Need to coordinate with ACPICA update (Bob Moore, Dan Williams)
- Integrate with ACPICA changes in struct acpi_nfit_control_region.
(commit 138a95547ab0)
v2:
- Remove 'mfg_location' and 'mfg_date'. (Dan Williams)
- Rename 'unique_id' to 'id' and make this change as a separate patch.
(Dan Williams)
---
Toshi Kani (3):
1/2 acpi/nfit: Update nfit driver to comply with ACPI 6.1
2/3 acpi/nfit: Add sysfs "id" for NVDIMM ID
---
drivers/acpi/nfit.c | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)
2 years, 8 months
Re: KVM "fake DAX" flushing interface - discussion
by Dan Williams
On Wed, Jul 26, 2017 at 2:27 PM, Rik van Riel <riel(a)redhat.com> wrote:
> On Wed, 2017-07-26 at 09:47 -0400, Pankaj Gupta wrote:
>> >
>> Just want to summarize here(high level):
>>
>> This will require implementing new 'virtio-pmem' device which
>> presents
>> a DAX address range(like pmem) to guest with read/write(direct
>> access)
>> & device flush functionality. Also, qemu should implement
>> corresponding
>> support for flush using virtio.
>>
> Alternatively, the existing pmem code, with
> a flush-only block device on the side, which
> is somehow associated with the pmem device.
>
> I wonder which alternative leads to the least
> code duplication, and the least maintenance
> hassle going forward.
I'd much prefer to have another driver. I.e. a driver that refactors
out some common pmem details into a shared object and can attach to
ND_DEVICE_NAMESPACE_{IO,PMEM}. A control device on the side seems like
a recipe for confusion.
With a $new_driver in hand you can just do:
modprobe $new_driver
echo $namespace > /sys/bus/nd/drivers/nd_pmem/unbind
echo $namespace > /sys/bus/nd/drivers/$new_driver/new_id
echo $namespace > /sys/bus/nd/drivers/$new_driver/bind
...and the guest can arrange for $new_driver to be the default, so you
don't need to do those steps each boot of the VM, by doing:
echo "blacklist nd_pmem" > /etc/modprobe.d/virt-dax-flush.conf
echo "alias nd:t4* $new_driver" >> /etc/modprobe.d/virt-dax-flush.conf
echo "alias nd:t5* $new_driver" >> /etc/modprobe.d/virt-dax-flush.conf
3 years, 3 months
Enabling peer to peer device transactions for PCIe devices
by Deucher, Alexander
This is certainly not the first time this has been brought up, but I'd like to try and get some consensus on the best way to move this forward. Allowing devices to talk directly improves performance and reduces latency by avoiding the use of staging buffers in system memory. Also in cases where both devices are behind a switch, it avoids the CPU entirely. Most current APIs (DirectGMA, PeerDirect, CUDA, HSA) that deal with this are pointer based. Ideally we'd be able to take a CPU virtual address and be able to get to a physical address taking into account IOMMUs, etc. Having struct pages for the memory would allow it to work more generally and wouldn't require as much explicit support in drivers that wanted to use it.
Some use cases:
1. Storage devices streaming directly to GPU device memory
2. GPU device memory to GPU device memory streaming
3. DVB/V4L/SDI devices streaming directly to GPU device memory
4. DVB/V4L/SDI devices streaming directly to storage devices
Here is a relatively simple example of how this could work for testing. This is obviously not a complete solution.
- Device memory will be registered with Linux memory sub-system by created corresponding struct page structures for device memory
- get_user_pages_fast() will return corresponding struct pages when CPU address points to the device memory
- put_page() will deal with struct pages for device memory
Previously proposed solutions and related proposals:
1.P2P DMA
DMA-API/PCI map_peer_resource support for peer-to-peer (http://www.spinics.net/lists/linux-pci/msg44560.html)
Pros: Low impact, already largely reviewed.
Cons: requires explicit support in all drivers that want to support it, doesn't handle S/G in device memory.
2. ZONE_DEVICE IO
Direct I/O and DMA for persistent memory (https://lwn.net/Articles/672457/)
Add support for ZONE_DEVICE IO memory with struct pages. (https://patchwork.kernel.org/patch/8583221/)
Pro: Doesn't waste system memory for ZONE metadata
Cons: CPU access to ZONE metadata slow, may be lost, corrupted on device reset.
3. DMA-BUF
RDMA subsystem DMA-BUF support (http://www.spinics.net/lists/linux-rdma/msg38748.html)
Pros: uses existing dma-buf interface
Cons: dma-buf is handle based, requires explicit dma-buf support in drivers.
4. iopmem
iopmem : A block device for PCIe memory (https://lwn.net/Articles/703895/)
5. HMM
Heterogeneous Memory Management (http://lkml.iu.edu/hypermail/linux/kernel/1611.2/02473.html)
6. Some new mmap-like interface that takes a userptr and a length and returns a dma-buf and offset?
Alex
3 years, 4 months
FIle copy to FAT FS on NVDIMM hits BUG_ON at fs/buffer.c:3305!
by Kani, Toshimitsu
Hi,
Copying files to vfat FS on an NVDIMM device hits
BUG_ON(!PageLocked(page)) in try_to_free_buffers(). It happens on
4.13-rc1, and happens on older kernels as well.
A simple reproducer is shown below. It is 100% reproducible on my
setup (8GB of regular memory and 16GB of NVDIMM). It usually hits in
the 3rd or 4th file copy and does not repeat with the while-loop.
Interestingly, it hits only when an NVDIMM device is set as raw or
memory mode. It does not hit with sector mode.
==
DEV=pmem0
set -x
mkfs.vfat /dev/$DEV
mount /dev/$DEV /mnt/$DEV
dd if=/dev/zero of=/mnt/$DEV/1Gfile bs=1M count=1024
while true; do
cp /mnt/$DEV/1Gfile /mnt/$DEV/file-1
cp /mnt/$DEV/1Gfile /mnt/$DEV/file-2
cp /mnt/$DEV/1Gfile /mnt/$DEV/file-3
cp /mnt/$DEV/1Gfile /mnt/$DEV/file-4
cp /mnt/$DEV/1Gfile /mnt/$DEV/file-5
cp /mnt/$DEV/1Gfile /mnt/$DEV/file-6
cp /mnt/$DEV/1Gfile /mnt/$DEV/file-7
cp /mnt/$DEV/1Gfile /mnt/$DEV/file-8
cp /mnt/$DEV/1Gfile /mnt/$DEV/file-9
cp /mnt/$DEV/1Gfile /mnt/$DEV/file-10
done
==
kernel BUG at fs/buffer.c:3305!
invalid opcode: 0000 [#1] SMP
:
Workqueue: writeback wb_workfn (flush-259:0)
task: ffff8d02595b8000 task.stack: ffffa22242400000
RIP: 0010:try_to_free_buffers+0xd2/0xe0
RSP: 0018:ffffa22242403830 EFLAGS: 00010246
RAX: 00afffc000001028 RBX: 0000000000000008 RCX: ffff8d012dcf19c0
RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffffc468e3b52b80
RBP: ffffa22242403858 R08: 0000000000000000 R09: 000000000002067c
R10: ffff8d027ffe6000 R11: 0000000000000000 R12: 0000000000000000
R13: ffff8d022fccdbe0 R14: ffffc468e3b52b80 R15: ffffa22242403ad0
FS: 0000000000000000(0000) GS:ffff8d027fd40000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f9d2bb80b70 CR3: 000000084fe09000 CR4: 00000000007406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
clean_buffers+0x5d/0x70
__mpage_writepage+0x567/0x760
? page_mkclean+0x6a/0xb0
write_cache_pages+0x205/0x580
? clean_buffers+0x70/0x70
? fat_add_cluster+0x80/0x80 [fat]
mpage_writepages+0x7c/0x100
? fat_add_cluster+0x80/0x80 [fat]
? __set_page_dirty+0x9b/0xc0
? fprop_fraction_percpu+0x2f/0x80
fat_writepages+0x15/0x20 [fat]
? fat_writepages+0x15/0x20 [fat]
do_writepages+0x25/0x80
__writeback_single_inode+0x45/0x350
writeback_sb_inodes+0x25e/0x610
__writeback_inodes_wb+0x92/0xc0
wb_writeback+0x29b/0x340
wb_workfn+0x195/0x3d0
? wb_workfn+0x195/0x3d0
process_one_work+0x193/0x3d0
worker_thread+0x4e/0x3d0
kthread+0x114/0x150
? process_one_work+0x3d0/0x3d0
? kthread_park+0x60/0x60
? kthread_park+0x60/0x60
ret_from_fork+0x25/0x30
:
RIP: try_to_free_buffers+0xd2/0xe0 RSP: ffffa22242403830
Thanks,
-Toshi
3 years, 5 months
[PATCH] nvdimm: Remove minimum size requirement
by Matthew Wilcox
From: Matthew Wilcox <mawilcox(a)microsoft.com>
There was no need to have a minimum size of 4MB for NV-DIMMs; it was
just a sanity check. Keep a check that it's at least one page in size
because we really can't add less than a page to the memory map. Promote
the print statement from 'debug' level to 'warning', since there was no
information for my colleague who stumbled over this problem while
attempting to add a 2MB chunk of memory.
Reported-by: Cheng-mean Liu <soccerl(a)microsoft.com>
Signed-off-by: Matthew Wilcox <mawilcox(a)microsoft.com>
---
drivers/nvdimm/namespace_devs.c | 6 +++---
include/uapi/linux/ndctl.h | 4 ----
2 files changed, 3 insertions(+), 7 deletions(-)
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 5f1c6756e57c..95169308078a 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -1689,9 +1689,9 @@ struct nd_namespace_common *nvdimm_namespace_common_probe(struct device *dev)
}
size = nvdimm_namespace_capacity(ndns);
- if (size < ND_MIN_NAMESPACE_SIZE) {
- dev_dbg(&ndns->dev, "%pa, too small must be at least %#x\n",
- &size, ND_MIN_NAMESPACE_SIZE);
+ if (size < PAGE_SIZE) {
+ dev_warn(&ndns->dev, "%pa, too small must be at least %ld\n",
+ &size, PAGE_SIZE);
return ERR_PTR(-ENODEV);
}
diff --git a/include/uapi/linux/ndctl.h b/include/uapi/linux/ndctl.h
index 6d3c54264d8e..3ad1623bb585 100644
--- a/include/uapi/linux/ndctl.h
+++ b/include/uapi/linux/ndctl.h
@@ -299,10 +299,6 @@ enum nd_driver_flags {
ND_DRIVER_DAX_PMEM = 1 << ND_DEVICE_DAX_PMEM,
};
-enum {
- ND_MIN_NAMESPACE_SIZE = 0x00400000,
-};
-
enum ars_masks {
ARS_STATUS_MASK = 0x0000FFFF,
ARS_EXT_STATUS_SHIFT = 16,
--
2.11.0
3 years, 6 months
[RFC PATCH 0/7] dax, ext4: Synchronous page faults
by Jan Kara
Hello,
after last discussions about whether / how to make flushing of DAX mappings
possible from userspace so that they can be flushed on finer than page
granularity and also avoid the overhead of a syscall, I've decided to give
a stab at implementing "synchronous page faults" idea for ext4 so that
we can see whether that is reasonably possible to implement and how would
such implementation look like. This patch set is the result.
So the functionality this patches implement: We have an inode flag (currently
I abuse S_SYNC inode flag for this and IMHO it kind of makes sense but if
people hate that I'm certainly open to using new flag in the final
implementation) that marks inode as requiring synchronous page faults.
The guarantee provided by this flag on inode is: While a block is writeably
mapped into page tables, it is guaranteed to be visible in the file at that
offset also after a crash.
How I implement this is that ->iomap_begin() indicates by a flag that inode
block mapping metadata is unstable and may need flushing (use the same test as
whether fdatasync() has metadata to write). If yes, DAX maps page table entries
read-only and returns special flag VM_FAULT_RO to the filesystem fault handler.
The handler then calls fdatasync() (vfs_fsync_range()) for the affected range
and after that calls DAX code to write-enable the page table entries.
>From my (fairly limited) knowledge of XFS it seems XFS should be able to do the
same and it should be even possible for filesystem to implement safe remapping
of a file offset to a different block (i.e. break reflink, do defrag, or
similar stuff) like:
1) Block page faults
2) fdatasync() remapped range (there can be outstanding data modifications
not yet flushed)
3) unmap_mapping_range()
4) Now remap blocks
5) Unblock page faults
Basically we do the same on events like punch hole so there is not much new
there.
There are couple of open questions with this implementation:
1) Is it worth the hassle?
2) Is S_SYNC good flag to use or should we use a new inode flag?
3) VM_FAULT_RO and especially passing of resulting 'pfn' from
dax_iomap_fault() through filesystem fault handler to dax_pfn_mkwrite() in
vmf->orig_pte is a bit of a hack. So far I'm not sure how to refactor
things to make this cleaner.
Anyway, here are the patches, comments are welcome.
Honza
3 years, 6 months
don't control-c during ndctl create-namespace?
by Linda Knippers
Hi Dan,
I've got 4 NVDIMMs in an interleave set in a configuration that supports labels.
I'm running a 4.12 kernel with the latest ndctl.
I had three namespaces configured and all seemed well. When I configured the
fourth one, I made a mistake in the name so I hit control-c. I wasn't sure what
state I was in but according to what I could see with ndctl, it had created the
namespace but not enabled it, so I enabled it manually with ndctl and that
seemed ok.
Then I tried to use ndctl create-namespace to change the name, which failed
because the namespace was enabled so I disabled it and tried again. At some
point, not really sure where, I got this kernel warning:
# [ 5224.196085] nd namespace4.3: failed to track label: 4
(details in the attached file)
At this point I rebooted the system. When it came back up, nmem0 was disabled.
I dumped the labels (also attached) and I see that nmem0 has some extra labels
that correspond to the namespace that I was struggling with.
I think my troubles started with the control-c. It doesn't look like ndctl traps
signals when creating namespaces so perhaps we can get into an inconsistent
state.
It also seems like that kernel warning is a bit more important than a
WARN_ONCE would imply. I think that was the beginning of the end of my
configuration. It might have been better to just panic.
I was trying to figure out if I could fix my configuration without
losing the good namespaces but I don't see a way. The check-labels option
isn't very helpful because I think it only looks at the info blocks,
which are fine, even though the labels on nmem0 are not. The destroy-namespace
option doesn't help because it only works with a good namespace.
I'm going to wipe my nvdimms and start over. I suspect the problem is
reproducible but it could depend on the timing of the control-c, unless
the root cause was actually trying to rename a namespace. Maybe I'll try
that again but not today.
-- ljk
3 years, 6 months
Standardization of ACPI NVDIMM DSMs
by Rebecca Cran
I'm pretty new to ACPI work so it's possibly I'm misunderstanding
something. I've recently started working on NVDIMMs, and have
noticed that both HPE and Intel have DSM "Example" interfaces that are
referenced/used in Linux. I've been wondering if there's a reason
the content from both couldn't be combined and added to the ACPI
specification with sufficient vendor-specific fields to support the
cases where they need to differ?
--
Rebecca Cran
3 years, 6 months
[PATCH 0/6] arm64 pmem support
by Robin Murphy
Hi all,
With the latest updates to the pmem API, the arch code contribution
becomes very straightforward to wire up - I think there's about as
much code here to just cope with the existence of our new instruction
as there is to actually make use of it. I don't have access to any
NVDIMMs nor suitable hardware to put them in, so this is written purely
to spec - the extent of testing has been the feature detection on a
v8.2 Fast Model vs. v8.0 systems.
Patch #1 could go in as a fix ahead of the rest; it just needs to come
before patch #5 to prevent that blowing up the build.
Robin.
Robin Murphy (6):
arm64: mm: Fix set_memory_valid() declaration
arm64: Convert __inval_cache_range() to area-based
arm64: Expose DC CVAP to userspace
arm64: Handle trapped DC CVAP
arm64: Implement pmem API support
arm64: uaccess: Implement *_flushcache variants
Documentation/arm64/cpu-feature-registers.txt | 2 ++
arch/arm64/Kconfig | 12 +++++++
arch/arm64/include/asm/assembler.h | 6 ++++
arch/arm64/include/asm/cacheflush.h | 4 ++-
arch/arm64/include/asm/cpucaps.h | 3 +-
arch/arm64/include/asm/esr.h | 3 +-
arch/arm64/include/asm/string.h | 4 +++
arch/arm64/include/asm/sysreg.h | 1 +
arch/arm64/include/asm/uaccess.h | 12 +++++++
arch/arm64/include/uapi/asm/hwcap.h | 1 +
arch/arm64/kernel/cpufeature.c | 13 ++++++++
arch/arm64/kernel/cpuinfo.c | 1 +
arch/arm64/kernel/head.S | 18 +++++-----
arch/arm64/kernel/traps.c | 3 ++
arch/arm64/lib/Makefile | 2 ++
arch/arm64/lib/uaccess_flushcache.c | 47 +++++++++++++++++++++++++++
arch/arm64/mm/cache.S | 37 ++++++++++++++++-----
arch/arm64/mm/pageattr.c | 18 ++++++++++
18 files changed, 166 insertions(+), 21 deletions(-)
create mode 100644 arch/arm64/lib/uaccess_flushcache.c
--
2.12.2.dirty
3 years, 6 months
[PATCH] ndctl: daxctl: Adding io option for daxctl
by Dave Jiang
The daxctl io option allows I/Os to be performed between block/file to
and from device dax files. It also provides a way to zero a device dax
device.
i.e. daxctl io --input=/home/myfile --output=/dev/dax1.0
Signed-off-by: Dave Jiang <dave.jiang(a)intel.com>
---
Documentation/Makefile.am | 3
Documentation/daxctl-io.txt | 71 +++++
daxctl/Makefile.am | 5
daxctl/daxctl.c | 2
daxctl/io.c | 567 +++++++++++++++++++++++++++++++++++++++++++
5 files changed, 646 insertions(+), 2 deletions(-)
create mode 100644 Documentation/daxctl-io.txt
create mode 100644 daxctl/io.c
diff --git a/Documentation/Makefile.am b/Documentation/Makefile.am
index c7e0758..8efdbc2 100644
--- a/Documentation/Makefile.am
+++ b/Documentation/Makefile.am
@@ -26,7 +26,8 @@ man1_MANS = \
ndctl-destroy-namespace.1 \
ndctl-check-namespace.1 \
ndctl-list.1 \
- daxctl-list.1
+ daxctl-list.1 \
+ daxctl-io.1
CLEANFILES = $(man1_MANS)
diff --git a/Documentation/daxctl-io.txt b/Documentation/daxctl-io.txt
new file mode 100644
index 0000000..c3ddd15
--- /dev/null
+++ b/Documentation/daxctl-io.txt
@@ -0,0 +1,71 @@
+daxctl-io(1)
+===========
+
+NAME
+----
+daxctl-io - Perform I/O on Device-DAX devices or zero a Device-DAX device.
+
+SYNOPSIS
+--------
+[verse]
+'daxctl io' [<options>]
+
+There must be a Device-DAX device involved whether as the input or the output
+device. Read from a Device-DAX device and write to a file, a block device,
+another Device-DAX device, or stdout (if no output is provided). Write
+to a Device-DAX device from a file, a block device, or stdin, or another
+Device-DAX device.
+
+No length specified will default to input file/device length. If input is
+a special char file then length will be the output file/device length.
+
+No input will default to stdin. No output will default to stdout.
+
+For a Device-DAX device, attempts to clear badblocks within range of writes
+will be performed.
+
+EXAMPLE
+-------
+[verse]
+# daxctl io --zero /dev/dax1.0
+
+# daxctl io --input=/dev/dax1.0 --output=/home/myfile --len=2097152 --seek=4096
+
+# cat /dev/zero | daxctl io --output=/dev/dax1.0
+
+# daxctl io --input=/dev/zero --output=/dev/dax1.0 --skip=4096
+
+OPTIONS
+-------
+-i::
+--input=::
+ Input device or file to read from.
+
+-o::
+--output=::
+ Output device or file to write to.
+
+-z::
+--zero::
+ Zero the output device for 'len' size. Or the entire device if no
+ length was provided. The output device must be a Device DAX device.
+
+-l::
+--len::
+ The length in bytes to perform the I/O.
+
+-s::
+--seek::
+ The number of bytes to skip over on the output before performing a
+ write.
+
+-k::
+--skip::
+ The number of bytes to skip over on the input before performing a read.
+
+COPYRIGHT
+---------
+Copyright (c) 2017, Intel Corporation. License GPLv2: GNU GPL
+version 2 <http://gnu.org/licenses/gpl.html>. This is free software:
+you are free to change and redistribute it. There is NO WARRANTY, to
+the extent permitted by law.
diff --git a/daxctl/Makefile.am b/daxctl/Makefile.am
index fe467d0..1ba1f07 100644
--- a/daxctl/Makefile.am
+++ b/daxctl/Makefile.am
@@ -5,10 +5,13 @@ bin_PROGRAMS = daxctl
daxctl_SOURCES =\
daxctl.c \
list.c \
+ io.c \
../util/json.c
daxctl_LDADD =\
lib/libdaxctl.la \
+ ../ndctl/lib/libndctl.la \
../libutil.a \
$(UUID_LIBS) \
- $(JSON_LIBS)
+ $(JSON_LIBS) \
+ -lpmem
diff --git a/daxctl/daxctl.c b/daxctl/daxctl.c
index 91a4600..db2e495 100644
--- a/daxctl/daxctl.c
+++ b/daxctl/daxctl.c
@@ -67,11 +67,13 @@ static int cmd_help(int argc, const char **argv, void *ctx)
}
int cmd_list(int argc, const char **argv, void *ctx);
+int cmd_io(int argc, const char **argv, void *ctx);
static struct cmd_struct commands[] = {
{ "version", cmd_version },
{ "list", cmd_list },
{ "help", cmd_help },
+ { "io", cmd_io },
};
int main(int argc, const char **argv)
diff --git a/daxctl/io.c b/daxctl/io.c
new file mode 100644
index 0000000..92e2878
--- /dev/null
+++ b/daxctl/io.c
@@ -0,0 +1,567 @@
+/*
+ * Copyright(c) 2015-2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <stdio.h>
+#include <errno.h>
+#include <stdlib.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/sysmacros.h>
+#include <sys/param.h>
+#include <sys/mman.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <limits.h>
+#include <libgen.h>
+#include <libpmem.h>
+#include <util/json.h>
+#include <util/filter.h>
+#include <json-c/json.h>
+#include <daxctl/libdaxctl.h>
+#include <ccan/short_types/short_types.h>
+#include <util/parse-options.h>
+#include <ccan/array_size/array_size.h>
+#include <ndctl/ndctl.h>
+
+enum io_direction {
+ IO_READ = 0,
+ IO_WRITE,
+};
+
+struct io_dev {
+ int fd;
+ int major;
+ int minor;
+ void *mmap;
+ const char *parm_path;
+ char *real_path;
+ uint64_t offset;
+ enum io_direction direction;
+ bool is_dax;
+ bool is_char;
+ bool is_new;
+ bool need_trunc;
+ struct ndctl_ctx *ndctx;
+ struct ndctl_region *region;
+ struct ndctl_dax *dax;
+ uint64_t size;
+};
+
+static struct {
+ struct io_dev dev[2];
+ bool zero;
+ uint64_t len;
+ struct ndctl_cmd *ars_cap;
+ struct ndctl_cmd *clear_err;
+} io = {
+ .dev[0].fd = -1,
+ .dev[1].fd = -1,
+};
+
+#define fail(fmt, ...) \
+do { \
+ fprintf(stderr, "daxctl-%s:%s:%d: " fmt, \
+ VERSION, __func__, __LINE__, ##__VA_ARGS__); \
+} while (0)
+
+static bool is_stdinout(struct io_dev *io_dev)
+{
+ return (io_dev->fd == STDIN_FILENO ||
+ io_dev->fd == STDOUT_FILENO) ? true : false;
+}
+
+static int setup_device(struct io_dev *io_dev, struct ndctl_ctx *ctx,
+ size_t size)
+{
+ int flags, rc;
+
+ if (is_stdinout(io_dev))
+ return 0;
+
+ if (io_dev->is_new)
+ flags = O_CREAT|O_WRONLY|O_TRUNC;
+ else if (io_dev->need_trunc)
+ flags = O_RDWR | O_TRUNC;
+ else
+ flags = O_RDWR;
+
+ io_dev->fd = open(io_dev->parm_path, flags, S_IRUSR|S_IWUSR);
+ if (io_dev->fd == -1) {
+ rc = -errno;
+ perror("open");
+ return rc;
+ }
+
+ if (!io_dev->is_dax)
+ return 0;
+
+ flags = (io_dev->direction == IO_READ) ? PROT_READ : PROT_WRITE;
+ io_dev->mmap = mmap(NULL, size, flags, MAP_SHARED, io_dev->fd, 0);
+ if (io_dev->mmap == MAP_FAILED) {
+ rc = -errno;
+ perror("mmap");
+ return rc;
+ }
+
+ return 0;
+}
+
+static int match_device(struct io_dev *io_dev, struct daxctl_region *dregion)
+{
+ struct daxctl_dev *dev;
+
+ daxctl_dev_foreach(dregion, dev) {
+ if (io_dev->major == daxctl_dev_get_major(dev) &&
+ io_dev->minor == daxctl_dev_get_minor(dev)) {
+ io_dev->is_dax = true;
+ io_dev->size = daxctl_dev_get_size(dev);
+ return 1;
+ }
+ }
+
+ return 0;
+}
+
+static int find_dax_device(struct io_dev *io_dev, struct ndctl_ctx *ndctx,
+ enum io_direction dir)
+{
+ struct ndctl_bus *bus;
+ struct ndctl_region *region;
+ struct ndctl_dax *dax;
+ struct daxctl_region *dregion;
+ struct stat st;
+ int rc;
+ char cdev_path[256];
+ char link_path[256];
+ char *dev_name;
+
+ if (is_stdinout(io_dev)) {
+ io_dev->size = ULONG_MAX;
+ return 0;
+ }
+
+ rc = stat(io_dev->parm_path, &st);
+ if (rc == -1) {
+ rc = -errno;
+ if (rc == -ENOENT && dir == IO_WRITE) {
+ io_dev->is_new = true;
+ io_dev->size = ULONG_MAX;
+ return 0;
+ }
+ perror("stat");
+ return rc;
+ }
+
+ if (S_ISREG(st.st_mode)) {
+ if (dir == IO_WRITE) {
+ io_dev->need_trunc = true;
+ io_dev->size = ULONG_MAX;
+ } else
+ io_dev->size = st.st_size;
+ return 0;
+ } else if (S_ISBLK(st.st_mode)) {
+ io_dev->size = st.st_size;
+ return 0;
+ } else if (S_ISCHR(st.st_mode)) {
+ io_dev->size = ULONG_MAX;
+ io_dev->is_char = true;
+ io_dev->major = major(st.st_rdev);
+ io_dev->minor = minor(st.st_rdev);
+ } else
+ return -ENODEV;
+
+ rc = snprintf(cdev_path, 255, "/sys/dev/char/%u:%u", io_dev->major,
+ io_dev->minor);
+ if (rc < 0) {
+ fail("snprintf\n");
+ return -ENXIO;
+ }
+
+ rc = readlink(cdev_path, link_path, 255);
+ if (rc == -1) {
+ rc = errno;
+ perror("readlink");
+ return rc;
+ }
+ link_path[rc] = '\0';
+ dev_name = basename(link_path);
+
+ ndctl_bus_foreach(ndctx, bus)
+ ndctl_region_foreach(bus, region)
+ ndctl_dax_foreach(region, dax) {
+ if (strncmp(dev_name,
+ ndctl_dax_get_devname(dax),
+ 256))
+ continue;
+
+ dregion = ndctl_dax_get_daxctl_region(dax);
+ if(match_device(io_dev, dregion)) {
+ io_dev->region = region;
+ io_dev->dax = dax;
+ return 1;
+ }
+ }
+ return 0;
+}
+
+static int send_clear_error(struct ndctl_bus *bus, uint64_t start, uint64_t size)
+{
+ uint64_t cleared;
+ int rc;
+
+ io.clear_err = ndctl_bus_cmd_new_clear_error(start, size, io.ars_cap);
+ if (!io.clear_err) {
+ fail("bus: %s failed to create cmd\n",
+ ndctl_bus_get_provider(bus));
+ return -ENXIO;
+ }
+
+ rc = ndctl_cmd_submit(io.clear_err);
+ if (rc) {
+ fail("bus: %s failed to submit cmd: %d\n",
+ ndctl_bus_get_provider(bus), rc);
+ ndctl_cmd_unref(io.clear_err);
+ return rc;
+ }
+
+ cleared = ndctl_cmd_clear_error_get_cleared(io.clear_err);
+ if (cleared != size) {
+ fail("bus: %s expected to clear: %ld actual: %ld\n",
+ ndctl_bus_get_provider(bus),
+ size, cleared);
+ return -ENXIO;
+ }
+
+ return 0;
+}
+
+static int get_ars_cap(struct ndctl_bus *bus, uint64_t start, uint64_t size)
+{
+ int rc;
+
+ io.ars_cap = ndctl_bus_cmd_new_ars_cap(bus, start, size);
+ if (!io.ars_cap) {
+ fail("bus: %s failed to create cmd\n",
+ ndctl_bus_get_provider(bus));
+ return -ENOTTY;
+ }
+
+ rc = ndctl_cmd_submit(io.ars_cap);
+ if (rc) {
+ fail("bus: %s failed to submit cmd: %d\n",
+ ndctl_bus_get_provider(bus), rc);
+ ndctl_cmd_unref(io.ars_cap);
+ return rc;
+ }
+
+ if (ndctl_cmd_ars_cap_get_size(io.ars_cap) <
+ sizeof(struct nd_cmd_ars_status)) {
+ fail("bus: %s expected size >= %zd got: %d\n",
+ ndctl_bus_get_provider(bus),
+ sizeof(struct nd_cmd_ars_status),
+ ndctl_cmd_ars_cap_get_size(io.ars_cap));
+ ndctl_cmd_unref(io.ars_cap);
+ return -ENXIO;
+ }
+
+ return 0;
+}
+
+int clear_errors(struct ndctl_bus *bus, uint64_t start, uint64_t len)
+{
+ int rc;
+
+ rc = get_ars_cap(bus, start, len);
+ if (rc) {
+ fail("get_ars_cap failed\n");
+ return rc;
+ }
+
+ rc = send_clear_error(bus, start, len);
+ if (rc) {
+ fail("send_clear_error failed\n");
+ return rc;
+ }
+
+ return 0;
+}
+
+static int clear_badblocks(struct io_dev *dev, uint64_t len)
+{
+ unsigned long long dax_begin, dax_size, dax_end;
+ unsigned long long region_begin, offset;
+ unsigned long long size, io_begin, io_end, io_len;
+ struct badblock *bb;
+ int rc;
+
+ dax_begin = ndctl_dax_get_resource(dev->dax);
+ if (dax_begin == ULLONG_MAX)
+ return -ERANGE;
+
+ dax_size = ndctl_dax_get_size(dev->dax);
+ if (dax_size == ULLONG_MAX)
+ return -ERANGE;
+
+ dax_end = dax_begin + dax_size - 1;
+
+ region_begin = ndctl_region_get_resource(dev->region);
+ if (region_begin == ULLONG_MAX)
+ return -ERANGE;
+
+ ndctl_region_badblock_foreach(dev->region, bb) {
+ unsigned long long bb_begin, bb_end, begin, end;
+
+ bb_begin = region_begin + (bb->offset << 9);
+ bb_end = bb_begin + (bb->len << 9) - 1;
+
+ if (bb_end <= dax_begin || bb_begin >= dax_end)
+ continue;
+
+ if (bb_begin < dax_begin)
+ begin = dax_begin;
+ else
+ begin = bb_begin;
+
+ if (bb_end > dax_end)
+ end = dax_end;
+ else
+ end = bb_end;
+
+ offset = begin - dax_begin;
+ size = end - begin + 1;
+
+ /*
+ * If end of I/O is before badblock or the offset of the
+ * I/O is greater than the actual size of badblock range
+ */
+ if (dev->offset + len - 1 < offset || dev->offset > size)
+ continue;
+
+ io_begin = (dev->offset < offset) ? offset : dev->offset;
+ if ((dev->offset + len) < (offset + size))
+ io_end = offset + len;
+ else
+ io_end = offset + size;
+
+ io_len = io_end - io_begin;
+ io_begin += dax_begin;
+ rc = clear_errors(ndctl_region_get_bus(dev->region),
+ io_begin, io_len);
+ if (rc < 0)
+ return rc;
+ }
+
+ return 0;
+}
+
+static ssize_t __do_io(struct io_dev *dst_dev, struct io_dev *src_dev,
+ uint64_t len, bool zero)
+{
+ void *src, *dst;
+ ssize_t rc, count = 0;
+
+ if (zero && dst_dev->is_dax) {
+ dst = (uint8_t *)dst_dev->mmap + dst_dev->offset;
+ memset(dst, 0, len);
+ pmem_persist(dst, len);
+ rc = len;
+ } else if (dst_dev->is_dax && src_dev->is_dax) {
+ src = (uint8_t *)src_dev->mmap + src_dev->offset;
+ dst = (uint8_t *)dst_dev->mmap + dst_dev->offset;
+ pmem_memcpy_persist(dst, src, len);
+ rc = len;
+ } else if (src_dev->is_dax) {
+ src = (uint8_t *)src_dev->mmap + src_dev->offset;
+ if (dst_dev->offset) {
+ rc = lseek(dst_dev->fd, dst_dev->offset, SEEK_SET);
+ if (rc < 0) {
+ rc = -errno;
+ perror("lseek");
+ return rc;
+ }
+ }
+ do {
+ rc = write(dst_dev->fd, (uint8_t *)src + count,
+ len - count);
+ if (rc == -1) {
+ rc = -errno;
+ perror("write");
+ return rc;
+ }
+ count += rc;
+ } while (count != (ssize_t)len);
+ rc = count;
+ if (rc != (ssize_t)len)
+ printf("Requested size %lu larger than source.\n", len);
+ } else if (dst_dev->is_dax) {
+ dst = (uint8_t *)dst_dev->mmap + dst_dev->offset;
+ if (src_dev->offset) {
+ rc = lseek(src_dev->fd, src_dev->offset, SEEK_SET);
+ if (rc < 0) {
+ rc = -errno;
+ perror("lseek");
+ return rc;
+ }
+ }
+ do {
+ rc = read(src_dev->fd, (uint8_t *)dst + count,
+ len - count);
+ if (rc == -1) {
+ rc = -errno;
+ perror("pread");
+ return rc;
+ }
+ /* end of file */
+ if (rc == 0)
+ break;
+ count += rc;
+ } while (count != (ssize_t)len);
+ pmem_persist(dst, count);
+ rc = count;
+ if (rc != (ssize_t)len)
+ printf("Requested size %lu larger than destination.\n", len);
+ } else
+ return -EINVAL;
+
+ return rc;
+}
+
+static int do_io(struct ndctl_ctx *ctx)
+{
+ int rc, i, dax_devs = 0;
+
+ /* if we are zeroing the device, we just need output */
+ i = io.zero ? 1 : 0;
+ for (; i < 2; i++) {
+ if (!io.dev[i].parm_path)
+ continue;
+ rc = find_dax_device(&io.dev[i], ctx, i);
+ if (rc < 0)
+ return rc;
+
+ if (rc == 1)
+ dax_devs++;
+ }
+
+ if (dax_devs == 0) {
+ fail("No DAX devices for input or output, fail\n");
+ return -ENODEV;
+ }
+
+ if (io.len == 0) {
+ if (is_stdinout(&io.dev[0]))
+ io.len = io.dev[1].size;
+ else
+ io.len = io.dev[0].size;
+ }
+
+ io.dev[1].direction = IO_WRITE;
+ i = io.zero ? 1 : 0;
+ for (; i < 2; i++) {
+ if (!io.dev[i].parm_path)
+ continue;
+ rc = setup_device(&io.dev[i], ctx, io.len);
+ if (rc < 0)
+ return rc;
+ }
+
+ if (io.dev[1].is_dax) {
+ rc = clear_badblocks(&io.dev[1], io.len);
+ if (rc < 0) {
+ fail("Failed to clear badblocks on %s\n",
+ io.dev[1].parm_path);
+ return rc;
+ }
+ }
+
+ rc = __do_io(&io.dev[1], &io.dev[0], io.len, io.zero);
+ if (rc < 0) {
+ fail("Failed to perform I/O\n");
+ return rc;
+ }
+
+ printf("Data copied %u bytes to device %s\n",
+ rc, io.dev[1].parm_path);
+
+ return 0;
+}
+
+static void cleanup(struct ndctl_ctx *ctx)
+{
+ int i;
+
+ for (i = 0; i < 2; i++) {
+ if (is_stdinout(&io.dev[i]))
+ continue;
+ close(io.dev[i].fd);
+ }
+}
+
+int cmd_io(int argc, const char **argv, void *ctx)
+{
+ const struct option options[] = {
+ OPT_STRING('i', "input", &io.dev[0].parm_path, "in device",
+ "input device/file"),
+ OPT_STRING('o', "output", &io.dev[1].parm_path, "out device",
+ "output device/file"),
+ OPT_BOOLEAN('z', "zero", &io.zero, "zeroing the device"),
+ OPT_U64('l', "len", &io.len, "total length to perform the I/O"),
+ OPT_U64('s', "seek", &io.dev[1].offset, "seek offset for output"),
+ OPT_U64('k', "skip", &io.dev[0].offset, "skip offset for input"),
+ };
+ const char * const u[] = {
+ "daxctl io [<options>]",
+ NULL
+ };
+ int i, rc;
+ struct ndctl_ctx *ndctx;
+
+ argc = parse_options(argc, argv, options, u, 0);
+ for (i = 0; i < argc; i++) {
+ fail("Unknown parameter \"%s\"\n", argv[i]);
+ return -EINVAL;
+ }
+
+ if (argc) {
+ usage_with_options(u, options);
+ return 0;
+ }
+
+ if (!io.dev[0].parm_path && !io.dev[1].parm_path) {
+ usage_with_options(u, options);
+ return 0;
+ }
+
+ if (!io.dev[0].parm_path) {
+ io.dev[0].fd = STDIN_FILENO;
+ io.dev[0].offset = 0;
+ }
+
+ if (!io.dev[1].parm_path) {
+ io.dev[1].fd = STDOUT_FILENO;
+ io.dev[1].offset = 0;
+ }
+
+ rc = ndctl_new(&ndctx);
+ if (rc)
+ return -ENOMEM;
+
+ rc = do_io(ndctx);
+ if (rc < 0)
+ goto out;
+
+ rc = 0;
+out:
+ cleanup(ndctx);
+ ndctl_unref(ndctx);
+ return rc;
+}
3 years, 6 months