Re: Detecting NUMA per pmem
by Oren Berman
Hi Ross
Thanks for the speedy reply. I am also adding the public list to this
thread as you suggested.
We have tried to dump the SPA table and this is what we get:
/*
* Intel ACPI Component Architecture
* AML/ASL+ Disassembler version 20160108-64
* Copyright (c) 2000 - 2016 Intel Corporation
*
* Disassembly of NFIT, Sun Oct 22 10:46:19 2017
*
* ACPI Data Table [NFIT]
*
* Format: [HexOffset DecimalOffset ByteLength] FieldName : FieldValue
*/
[000h 0000 4] Signature : "NFIT" [NVDIMM Firmware
Interface Table]
[004h 0004 4] Table Length : 00000028
[008h 0008 1] Revision : 01
[009h 0009 1] Checksum : B2
[00Ah 0010 6] Oem ID : "SUPERM"
[010h 0016 8] Oem Table ID : "SMCI--MB"
[018h 0024 4] Oem Revision : 00000001
[01Ch 0028 4] Asl Compiler ID : " "
[020h 0032 4] Asl Compiler Revision : 00000001
[024h 0036 4] Reserved : 00000000
Raw Table Data: Length 40 (0x28)
0000: 4E 46 49 54 28 00 00 00 01 B2 53 55 50 45 52 4D // NFIT(.....SUPERM
0010: 53 4D 43 49 2D 2D 4D 42 01 00 00 00 01 00 00 00 // SMCI--MB........
0020: 01 00 00 00 00 00 00 00
As you can see the memory region info is missing.
This specific check was done on a supermicro server.
We also performed a bios update but the results were the same.
As said before ,the pmem devices are detected correctly and we verified
that they correspond to different numa nodes using the PCM utility.However,
linux still reports both pmem devices to be on the same numa - Numa 0.
If this information is missing, why pmem devices and address ranges are
still detected correctly?
Is there another table that we need to check?
I also ran dmidecode and the NVDIMMs are being listed (we tested with
netlist NVDIMMs). I can also see the bank locator showing P0 and P1 which I
think indicates the numa. Here is an example:
Handle 0x002D, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x002A
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: P1-DIMMA3
Bank Locator: P0_Node0_Channel0_Dimm2
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MHz
Manufacturer: Netlist
Serial Number: 66F50006
Asset Tag: P1-DIMMA3_AssetTag (date:16/42)
Part Number: NV3A74SBT20-000
Rank: 1
Configured Clock Speed: 1600 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown
Handle 0x003B, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0038
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: P2-DIMME3
Bank Locator: P1_Node1_Channel0_Dimm2
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MHz
Manufacturer: Netlist
Serial Number: 66B50010
Asset Tag: P2-DIMME3_AssetTag (date:16/42)
Part Number: NV3A74SBT20-000
Rank: 1
Configured Clock Speed: 1600 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown
Did you encounter such a a case? We would appreciate any insight you might
have.
BR
Oren Berman
On 20 October 2017 at 19:22, Ross Zwisler <ross.zwisler(a)linux.intel.com>
wrote:
> On Thu, Oct 19, 2017 at 06:12:24PM +0300, Oren Berman wrote:
> > Hi Ross
> > My name is Oren Berman and I am a senior developer at lightbitslabs.
> > We are working with NDIMMs but we encountered a problem that the
> kernel
> > does not seem to detect the numa id per PMEM device.
> > It always reports numa 0 although we have NVDIMM devices on both
> nodes.
> > We checked that it always returns 0 from sysfs and also from
> retrieving
> > the device of pmem in the kernel and calling dev_to_node.
> > The result is always 0 for both pmem0 and pmem1.
> > In order to make sure that indeed both numa sockets are used we ran
> > intel's pcm utlity. We verified that writing to pmem 0 increases
> socket 0
> > utilization and writing to pmem1 increases socket 1 utilization so
> the hw
> > works properly.
> > Only the detection seems to be invalid.
> > Did you encounter such a problem?
> > We are using kernel version 4.9 - are you aware of any fix for this
> issue
> > or workaround that we can use.
> > Are we missing something?
> > Thanks for any help you can give us.
> > BR
> > Oren Berman
>
> Hi Oren,
>
> My first guess is that your platform isn't properly filling out the
> "proximity
> domain" field in the NFIT SPA table.
>
> See section 5.2.25.2 in ACPI 6.2:
> http://uefi.org/sites/default/files/resources/ACPI_6_2.pdf
>
> Here's how to check that:
>
> # cd /tmp
> # cp /sys/firmware/acpi/tables/NFIT .
> # iasl NFIT
>
> Intel ACPI Component Architecture
> ASL+ Optimizing Compiler version 20160831-64
> Copyright (c) 2000 - 2016 Intel Corporation
>
> Binary file appears to be a valid ACPI table, disassembling
> Input file NFIT, Length 0xE0 (224) bytes
> ACPI: NFIT 0x0000000000000000 0000E0 (v01 BOCHS BXPCNFIT 00000001 BXPC
> 00000001)
> Acpi Data Table [NFIT] decoded
> Formatted output: NFIT.dsl - 5191 bytes
>
> This will give you an NFIT.dsl file which you can look at. Here is what my
> SPA table looks like for an emulated QEMU NVDIMM:
>
> [028h 0040 2] Subtable Type : 0000 [System Physical
> Address Range]
> [02Ah 0042 2] Length : 0038
>
> [02Ch 0044 2] Range Index : 0002
> [02Eh 0046 2] Flags (decoded below) : 0003
> Add/Online Operation Only : 1
> Proximity Domain Valid : 1
> [030h 0048 4] Reserved : 00000000
> [034h 0052 4] Proximity Domain : 00000000
> [038h 0056 16] Address Range GUID :
> 66F0D379-B4F3-4074-AC43-0D3318B78CDB
> [048h 0072 8] Address Range Base : 0000000240000000
> [050h 0080 8] Address Range Length : 0000000440000000
> [058h 0088 8] Memory Map Attribute : 0000000000008008
>
> So, the "Proximity Domain" field is 0, and this lets the system know which
> NUMA node to associate with this memory region.
>
> BTW, in the future it's best to CC our public list,
> linux-nvdimm(a)lists.01.org,
> as a) someone else might have the same question and b) someone else might
> know
> the answer.
>
> Thanks,
> - Ross
>
2 years, 9 months
[PATCH v3 0/2] Support ACPI 6.1 update in NFIT Control Region Structure
by Toshi Kani
ACPI 6.1, Table 5-133, updates NVDIMM Control Region Structure as
follows.
- Valid Fields, Manufacturing Location, and Manufacturing Date
are added from reserved range. No change in the structure size.
- IDs (SPD values) are stored as arrays of bytes (i.e. big-endian
format). The spec clarifies that they need to be represented
as arrays of bytes as well.
Patch 1 changes the NFIT driver to comply with ACPI 6.1.
Patch 2 adds a new sysfs file "id" to show NVDIMM ID defined in ACPI 6.1.
The patch-set applies on linux-pm.git acpica.
link: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
---
v3:
- Need to coordinate with ACPICA update (Bob Moore, Dan Williams)
- Integrate with ACPICA changes in struct acpi_nfit_control_region.
(commit 138a95547ab0)
v2:
- Remove 'mfg_location' and 'mfg_date'. (Dan Williams)
- Rename 'unique_id' to 'id' and make this change as a separate patch.
(Dan Williams)
---
Toshi Kani (3):
1/2 acpi/nfit: Update nfit driver to comply with ACPI 6.1
2/3 acpi/nfit: Add sysfs "id" for NVDIMM ID
---
drivers/acpi/nfit.c | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)
2 years, 9 months
Re: [dm-devel] [PATCH] dm-writecache
by Christoph Hellwig
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6/drivers/md/dm-writecache.c 2018-03-08 14:23:31.059999000 +0100
> @@ -0,0 +1,2417 @@
> +#include <linux/device-mapper.h>
missing copyright statement, or for those new-fashioned SPDX statement.
> +#define WRITEBACK_FUA true
no business having this around.
> +#ifndef bio_set_dev
> +#define bio_set_dev(bio, dev) ((bio)->bi_bdev = (dev))
> +#endif
> +#ifndef timer_setup
> +#define timer_setup(t, c, f) setup_timer(t, c, (unsigned long)(t))
> +#endif
no business in mainline.
> +/*
> + * On X86, non-temporal stores are more efficient than cache flushing.
> + * On ARM64, cache flushing is more efficient.
> + */
> +#if defined(CONFIG_X86_64)
> +#define NT_STORE(dest, src) \
> +do { \
> + typeof(src) val = (src); \
> + memcpy_flushcache(&(dest), &val, sizeof(src)); \
> +} while (0)
> +#define COMMIT_FLUSHED() wmb()
> +#else
> +#define NT_STORE(dest, src) WRITE_ONCE(dest, src)
> +#define FLUSH_RANGE dax_flush
> +#define COMMIT_FLUSHED() do { } while (0)
> +#endif
Please use proper APIs for this, this has no business in a driver.
And that's it for now. This is clearly not submission ready, and I
should got back to my backlog of other things.
2 years, 10 months
[PATCH v2] acpi: nfit: document sysfs interface
by Aishwarya Pant
This is an attempt to document the nfit sysfs interface. The
descriptions have been collected from git commit logs and the ACPI
specification 6.2.
Signed-off-by: Aishwarya Pant <aishpant(a)gmail.com>
---
Changes in v2:
- Add descriptions for range_index and ecc_unit_size
- Edit various descriptions as suggested
Documentation/ABI/testing/sysfs-bus-nfit | 233 +++++++++++++++++++++++++++++++
1 file changed, 233 insertions(+)
create mode 100644 Documentation/ABI/testing/sysfs-bus-nfit
diff --git a/Documentation/ABI/testing/sysfs-bus-nfit b/Documentation/ABI/testing/sysfs-bus-nfit
new file mode 100644
index 000000000000..619eb8ca0f99
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-nfit
@@ -0,0 +1,233 @@
+For all of the nmem device attributes under nfit/*, see the 'NVDIMM Firmware
+Interface Table (NFIT)' section in the ACPI specification
+(http://www.uefi.org/specifications) for more details.
+
+What: /sys/bus/nd/devices/nmemX/nfit/serial
+Date: Jun, 2015
+KernelVersion: v4.2
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Serial number of the NVDIMM (non-volatile dual in-line
+ memory module), assigned by the module vendor.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/handle
+Date: Apr, 2015
+KernelVersion: v4.2
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) The address (given by the _ADR object) of the device on its
+ parent bus of the NVDIMM device containing the NVDIMM region.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/device
+Date: Apr, 2015
+KernelVersion: v4.1
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Device id for the NVDIMM, assigned by the module vendor.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/rev_id
+Date: Jun, 2015
+KernelVersion: v4.2
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Revision of the NVDIMM, assigned by the module vendor.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/phys_id
+Date: Apr, 2015
+KernelVersion: v4.2
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Handle (i.e., instance number) for the SMBIOS (system
+ management BIOS) Memory Device structure describing the NVDIMM
+ containing the NVDIMM region.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/flags
+Date: Jun, 2015
+KernelVersion: v4.2
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) The flags in the NFIT memory device sub-structure indicate
+ the state of the data on the nvdimm relative to its energy
+ source or last "flush to persistence".
+
+ The attribute is a translation of the 'NVDIMM State Flags' field
+ in section 5.2.25.3 'NVDIMM Region Mapping' Structure of the
+ ACPI specification 6.2.
+
+ The health states are "save_fail", "restore_fail", "flush_fail",
+ "not_armed", "smart_event", "map_fail" and "smart_notify".
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/format
+What: /sys/bus/nd/devices/nmemX/nfit/format1
+What: /sys/bus/nd/devices/nmemX/nfit/formats
+Date: Apr, 2016
+KernelVersion: v4.7
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) The interface codes indicate support for persistent memory
+ mapped directly into system physical address space and / or a
+ block aperture access mechanism to the NVDIMM media.
+ The 'formats' attribute displays the number of supported
+ interfaces.
+
+ This layout is compatible with existing libndctl binaries that
+ only expect one code per-dimm as they will ignore
+ nmemX/nfit/formats and nmemX/nfit/formatN.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/vendor
+Date: Apr, 2016
+KernelVersion: v4.7
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Vendor id of the NVDIMM.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/dsm_mask
+Date: May, 2016
+KernelVersion: v4.7
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) The bitmask indicates the supported device specific control
+ functions relative to the NVDIMM command family supported by the
+ device
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/family
+Date: Apr, 2016
+KernelVersion: v4.7
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Displays the NVDIMM family command sets. Values
+ 0, 1, 2 and 3 correspond to NVDIMM_FAMILY_INTEL,
+ NVDIMM_FAMILY_HPE1, NVDIMM_FAMILY_HPE2 and NVDIMM_FAMILY_MSFT
+ respectively.
+
+ See the specifications for these command families here:
+ http://pmem.io/documents/NVDIMM_DSM_Interface-V1.6.pdf
+ https://github.com/HewlettPackard/hpe-nvm/blob/master/Documentation/
+ https://msdn.microsoft.com/library/windows/hardware/mt604741"
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/id
+Date: Apr, 2016
+KernelVersion: v4.7
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) ACPI specification 6.2 section 5.2.25.9, defines an
+ identifier for an NVDIMM, which refelects the id attribute.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/subsystem_vendor
+Date: Apr, 2016
+KernelVersion: v4.7
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Sub-system vendor id of the NVDIMM non-volatile memory
+ subsystem controller.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/subsystem_rev_id
+Date: Apr, 2016
+KernelVersion: v4.7
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Sub-system revision id of the NVDIMM non-volatile memory subsystem
+ controller, assigned by the non-volatile memory subsystem
+ controller vendor.
+
+
+What: /sys/bus/nd/devices/nmemX/nfit/subsystem_device
+Date: Apr, 2016
+KernelVersion: v4.7
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Sub-system device id for the NVDIMM non-volatile memory
+ subsystem controller, assigned by the non-volatile memory
+ subsystem controller vendor.
+
+
+What: /sys/bus/nd/devices/ndbusX/nfit/revision
+Date: Jun, 2015
+KernelVersion: v4.2
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) ACPI NFIT table revision number.
+
+
+What: /sys/bus/nd/devices/ndbusX/nfit/scrub
+Date: Sep, 2016
+KernelVersion: v4.9
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RW) This shows the number of full Address Range Scrubs (ARS)
+ that have been completed since driver load time. Userspace can
+ wait on this using select/poll etc. A '+' at the end indicates
+ an ARS is in progress
+
+ Writing a value of 1 triggers an ARS scan.
+
+
+What: /sys/bus/nd/devices/ndbusX/nfit/hw_error_scrub
+Date: Sep, 2016
+KernelVersion: v4.9
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RW) Provides a way to toggle the behavior between just adding
+ the address (cache line) where the MCE happened to the poison
+ list and doing a full scrub. The former (selective insertion of
+ the address) is done unconditionally.
+
+ This attribute can have the following values written to it:
+
+ '0': Switch to the default mode where an exception will only
+ insert the address of the memory error into the poison and
+ badblocks lists.
+ '1': Enable a full scrub to happen if an exception for a memory
+ error is received.
+
+
+What: /sys/bus/nd/devices/ndbusX/nfit/dsm_mask
+Date: Jun, 2017
+KernelVersion: v4.13
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) The bitmask indicates the supported bus specific control
+ functions. See the section named 'NVDIMM Root Device _DSMs' in
+ the ACPI specification.
+
+
+What: /sys/bus/nd/devices/regionX/nfit/range_index
+Date: Jun, 2015
+KernelVersion: v4.2
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) A unique number provided by the BIOS to identify an address
+ range. Used by NVDIMM Region Mapping Structure to uniquely refer
+ to this structure. Value of 0 is reserved and not used as an
+ index.
+
+
+What: /sys/bus/nd/devices/regionX/nfit/ecc_unit_size
+Date: Aug, 2017
+KernelVersion: v4.14
+Contact: linux-nvdimm(a)lists.01.org
+Description:
+ (RO) Size of a write request to a DIMM that will not incur a
+ read-modify-write cycle at the memory controller.
+
+ When the nfit driver initializes it runs an ARS (Address Range
+ Scrub) operation across every pmem range. Part of that process
+ involves determining the ARS capabilities of a given address
+ range. One of the capabilities that is reported is the 'Clear
+ Uncorrectable Error Range Length Unit Size' (see: ACPI 6.2
+ section 9.20.7.4 Function Index 1 - Query ARS Capabilities).
+ This property indicates the boundary at which the NVDIMM may
+ need to perform read-modify-write cycles to maintain ECC (Error
+ Correcting Code) blocks.
--
2.16.2
2 years, 11 months
[RFC v2 00/83] NOVA: a new file system for persistent memory
by Andiry Xu
From: Andiry Xu <jix024(a)cs.ucsd.edu>
This is the second version of RFC patch series that impements
NOVA (NOn-Volatile memory Accelerated file system), a new file system built for PMEM.
NOVA's goal is to provide a high performance, production-ready
file system tailored for byte-addressable non-volatile memories (e.g., NVDIMMs
and Intel's soon-to-be-released 3DXpoint DIMMs).
NOVA was developed at the Non-Volatile Systems Laboratory in the Computer
Science and Engineering Department at the University of California, San Diego.
Its primary authors are Andiry Xu <jix024(a)cs.ucsd.edu>, Lu Zhang
<luzh(a)eng.ucsd.edu>, and Steven Swanson <swanson(a)eng.ucsd.edu>.
NOVA is stable enough to run complex applications, but there is substantial
work left to do. This RFC is intended to gather feedback to guide its
development toward eventual inclusion upstream.
The patches are based on Linux 4.16-rc4.
Changes from v1:
* Remove snapshot, metadata replication and data parity for future submission.
This significantly reduces complexity and LOC: 22129 -> 13834.
* Breakdown the code in a more reviewer-friendly way:
The patchset starts with a simple skeleton and adds more features gradually.
Each patch leaves the tree in a compilable and working state,
and is self-contained and small, so easier to review.
* Fix bugs so that NOVA passes xfstests: https://github.com/NVSL/xfstests
Overview
========
NOVA is primarily a log-structured file system, but rather than maintain a
single global log for the entire file system, it maintains separate logs for
each inode. NOVA breaks the logs into 4KB pages, they need not be
contiguous in memory. The logs only contain metadata.
File data pages reside outside the log, and log entries for write operations
point to data pages they modify. File modification can be done in
either inplace update or copy-on-write (COW) way to provide atomic file updates.
For file operations that involve multiple inodes, NOVA use small, fixed-sized
redo logs to atomically append log entries to the logs of the inodes involved.
This structure keeps logs small and makes garbage collection very fast. It also
enables enormous parallelism during recovery from an unclean unmount, since
threads can scan logs in parallel.
Documentation/filesystems/NOVA.txt contains some lower-level implementation and
usage information. A more thorough discussion of NOVA's goals and design is
avaialable in two papers:
NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main Memories
http://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf
Jian Xu and Steven Swanson
Published in FAST 2016
NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System
http://cseweb.ucsd.edu/~swanson/papers/SOSP2017-NOVAFortis.pdf
Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah,
Amit Borase, Tamires Brito Da Silva, Andy Rudoff, Steven Swanson
Published in SOSP 2017
This version contains features from the FAST paper. We leave NOVA-Fortis
features for future.
Build and Run
=============
To build NOVA, build the kernel with PMEM (`CONFIG_BLK_DEV_PMEM`),
DAX (`CONFIG_FS_DAX`) and NOVA (`CONFIG_NOVA_FS`) support. Install as usual.
NOVA runs on a pmem non-volatile memory region created by memmap kernel option.
For instance, adding 'memmap=16G!8G' to the kernel boot parameters will reserve
16GB memory starting from address 8GB, and the kernel will create a pmem0
block device under the /dev directory.
After the OS has booted, initialize a NOVA instance with the following commands:
# modprobe nova
# mount -t NOVA -o init /dev/pmem0 /mnt/nova
The above commands create a NOVA instance on /dev/pmem0 and mounts it on
/mnt/nova. Currently NOVA does not have mkfs or fsck support.
Performance
===========
Comparing to other DAX file systems such as ext4-DAX and xfs-DAX,
NOVA provides fine-grained, byte granularity metadata operation,
and it performs better in metadata-intensive and write-intensive applications.
NOVA also excel in append-fsync access pattern, i.e. write-ahead logging,
which is very common in DBMS and key-value stores.
The following test is performed on Intel i7-3770K with 16GB DRAM
and 8GB PMEM emulated with DRAM. The kernel is 4.16-rc4 64bit on Ubuntu 16.04.
Performance may vary on different platforms.
Filebench throughout (ops/s):
xfs-DAX ext4-DAX NOVA
Fileserver 86971 177826 334166
Varmail 148032 288033 999794
Webserver 370245 370144 374130
Webproxy 315084 737544 927216
Webserver is read-intensive and all the file systems have similar performance.
SQLite test:
SQLite has four journaling modes:
Delete: delete the undo log file after transaction commit
Truncate: truncate the undo log file to zero after transaction commit
Persist: write a flag at the beginning of the log file after transaction commit
WAL: write-ahead logging
SQLite insert (transactions/s):
xfs-DAX ext4-DAX NOVA
Delete 18525 23615 45289
Truncate 21930 26391 52046
Persist 58053 56106 50554
WAL 38622 62703 85395
NOVA performs bad in Persist mode because it does copy-on-write for writes,
and writes 4KB for sub-page writes.
Redis: fsync the WAL file after every set.
Redis set throughout (trans/s):
xfs-DAX ext4-DAX NOVA
49771 88308 102560
RocksDB fillunique test (ops/s):
xfs-DAX ext4-DAX NOVA
WAL sync 33563 62066 295655
WAL nosync 254533 288106 393713
Both ext4-DAX and xfs-DAX suffer from high fsync overhead.
More test results are available in the two NOVA papers.
NOVA uses per-inode logging, per-CPU inode table and journal to avoid lock contention.
We use the FxMark test suite (https://github.com/sslab-gatech/fxmark)
to test the filesystem scalability. The result is at
http://cseweb.ucsd.edu/~jix024/sc.pdf
Thanks,
Andiry
---
Andiry Xu (83):
Introduction and documentation of NOVA filesystem.
Add nova_def.h.
Add super.h.
NOVA inode definition.
Add NOVA filesystem definitions and useful helper routines.
Add inode get/read methods.
Initialize inode_info and rebuild inode information in nova_iget().
NOVA superblock operations.
Add Kconfig and Makefile
Add superblock integrity check.
Add timing and I/O statistics for performance analysis and profiling.
Add timing for mount and init.
Add remount_fs and show_options methods.
Add range node kmem cache.
Add free list data structure.
Initialize block map and free lists in nova_init().
Add statfs support.
Add freelist statistics printing.
Add pmem block free routines.
Pmem block allocation routines.
Add log structure.
Inode log pages allocation and reclaimation.
Save allocator to pmem in put_super.
Initialize and allocate inode table.
Support get normal inode address and inode table extentsion.
Add inode_map to track inuse inodes.
Save the inode inuse list to pmem upon umount
Add NOVA address space operations
Add write_inode and dirty_inode routines.
New NOVA inode allocation.
Add new vfs inode allocation.
Add log entry definitions.
Inode log and entry printing for debug purpose.
Journal: NOVA light weight journal definitions.
Journal: Lite journal helper routines.
Journal: Lite journal recovery.
Journal: Lite journal create and commit.
Journal: NOVA lite journal initialization.
Log operation: dentry append.
Log operation: file write entry append.
Log operation: setattr entry append
Log operation: link change append.
Log operation: in-place update log entry
Log operation: invalidate log entries
Log operation: file inode log lookup and assign
Dir: Add Directory radix tree insert/remove methods.
Dir: Add initial dentries when initializing a directory inode log.
Dir: Readdir operation.
Dir: Append create/remove dentry.
Inode: Add nova_evict_inode.
Rebuild: directory inode.
Rebuild: file inode.
Namei: lookup.
Namei: create and mknod.
Namei: mkdir
Namei: link and unlink.
Namei: rmdir
Namei: rename
Namei: setattr
Add special inode operations.
Super: Add nova_export_ops.
File: getattr and file inode operations
File operation: llseek.
File operation: open, fsync, flush.
File operation: read.
Super: Add file write item cache.
Dax: commit list of file write items to log.
File operation: copy-on-write write.
Super: Add module param inplace_data_updates.
File operation: Inplace write.
Symlink support.
File operation: fallocate.
Dax: Add iomap operations.
File operation: Mmap.
File operation: read/write iter.
Ioctl support.
GC: Fast garbage collection.
GC: Thorough garbage collection.
Normal recovery.
Failure recovery: bitmap operations.
Failure recovery: Inode pages recovery routines.
Failure recovery: Per-CPU recovery.
Sysfs support.
Documentation/filesystems/00-INDEX | 2 +
Documentation/filesystems/nova.txt | 498 +++++++++++++
MAINTAINERS | 8 +
fs/Kconfig | 2 +
fs/Makefile | 1 +
fs/nova/Kconfig | 15 +
fs/nova/Makefile | 8 +
fs/nova/balloc.c | 730 ++++++++++++++++++
fs/nova/balloc.h | 96 +++
fs/nova/bbuild.c | 1437 ++++++++++++++++++++++++++++++++++++
fs/nova/bbuild.h | 28 +
fs/nova/dax.c | 970 ++++++++++++++++++++++++
fs/nova/dir.c | 520 +++++++++++++
fs/nova/file.c | 728 ++++++++++++++++++
fs/nova/gc.c | 459 ++++++++++++
fs/nova/inode.c | 1310 ++++++++++++++++++++++++++++++++
fs/nova/inode.h | 277 +++++++
fs/nova/ioctl.c | 184 +++++
fs/nova/journal.c | 412 +++++++++++
fs/nova/journal.h | 56 ++
fs/nova/log.c | 1111 ++++++++++++++++++++++++++++
fs/nova/log.h | 417 +++++++++++
fs/nova/namei.c | 848 +++++++++++++++++++++
fs/nova/nova.h | 566 ++++++++++++++
fs/nova/nova_def.h | 128 ++++
fs/nova/rebuild.c | 499 +++++++++++++
fs/nova/stats.c | 600 +++++++++++++++
fs/nova/stats.h | 178 +++++
fs/nova/super.c | 1063 ++++++++++++++++++++++++++
fs/nova/super.h | 171 +++++
fs/nova/symlink.c | 133 ++++
fs/nova/sysfs.c | 379 ++++++++++
32 files changed, 13834 insertions(+)
create mode 100644 Documentation/filesystems/nova.txt
create mode 100644 fs/nova/Kconfig
create mode 100644 fs/nova/Makefile
create mode 100644 fs/nova/balloc.c
create mode 100644 fs/nova/balloc.h
create mode 100644 fs/nova/bbuild.c
create mode 100644 fs/nova/bbuild.h
create mode 100644 fs/nova/dax.c
create mode 100644 fs/nova/dir.c
create mode 100644 fs/nova/file.c
create mode 100644 fs/nova/gc.c
create mode 100644 fs/nova/inode.c
create mode 100644 fs/nova/inode.h
create mode 100644 fs/nova/ioctl.c
create mode 100644 fs/nova/journal.c
create mode 100644 fs/nova/journal.h
create mode 100644 fs/nova/log.c
create mode 100644 fs/nova/log.h
create mode 100644 fs/nova/namei.c
create mode 100644 fs/nova/nova.h
create mode 100644 fs/nova/nova_def.h
create mode 100644 fs/nova/rebuild.c
create mode 100644 fs/nova/stats.c
create mode 100644 fs/nova/stats.h
create mode 100644 fs/nova/super.c
create mode 100644 fs/nova/super.h
create mode 100644 fs/nova/symlink.c
create mode 100644 fs/nova/sysfs.c
--
2.7.4
2 years, 11 months
[PATCH v8 00/18] dax: fix dma vs truncate/hole-punch
by Dan Williams
Changes since v7 [1]:
* Introduce noop_direct_IO() and use it to clean up xfs_dax_aops,
ext4_dax_aops, and ext2_dax_aops (Jan, Christoph)
* Clarify dax_associcate_entry() vs zero-page and empty entries with
for_each_mapped_pfn() and a comment (Jan)
* Collect reviewed-by's from Jan and Darrick
* Fix an ARCH=UML build failure that made me realize that the patch to
enable filesystems to trigger ->page_free() callbacks was incomplete
with respect to the device-mapper dax enabling.
The investigation of adding support for device-mapper and
DEV_PAGEMAP_OPS resulted in a wider rework that includes 1) picking up
the CONFIG_DAX_DRIVER patches that missed the 4.16 merge window. 2)
Refactoring the build implementation to allow FS_DAX_LIMITED in the s390
case with the dcssblk driver, and full blown FS_DAX + DEV_PAGEMAP_OPS
for everyone else with the pmem driver.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2018-March/014913.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2018-March/014921.html
---
Background:
get_user_pages() in the filesystem pins file backed memory pages for
access by devices performing dma. However, it only pins the memory pages
not the page-to-file offset association. If a file is truncated the
pages are mapped out of the file and dma may continue indefinitely into
a page that is owned by a device driver. This breaks coherency of the
file vs dma, but the assumption is that if userspace wants the
file-space truncated it does not matter what data is inbound from the
device, it is not relevant anymore. The only expectation is that dma can
safely continue while the filesystem reallocates the block(s).
Problem:
This expectation that dma can safely continue while the filesystem
changes the block map is broken by dax. With dax the target dma page
*is* the filesystem block. The model of leaving the page pinned for dma,
but truncating the file block out of the file, means that the filesytem
is free to reallocate a block under active dma to another file and now
the expected data-incoherency situation has turned into active
data-corruption.
Solution:
Defer all filesystem operations (fallocate(), truncate()) on a dax mode
file while any page/block in the file is under active dma. This solution
assumes that dma is transient. Cases where dma operations are known to
not be transient, like RDMA, have been explicitly disabled via
commits like 5f1d43de5416 "IB/core: disable memory registration of
filesystem-dax vmas".
The dax_layout_busy_page() routine is called by filesystems with a lock
held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
The process of looking up a busy page invalidates all mappings
to trigger any subsequent get_user_pages() to block on i_mmap_lock.
The filesystem continues to call dax_layout_busy_page() until it finally
returns no more active pages. This approach assumes that the page
pinning is transient, if that assumption is violated the system would
have likely hung from the uncompleted I/O.
---
Dan Williams (18):
dax: store pfns in the radix
fs, dax: prepare for dax-specific address_space_operations
block, dax: remove dead code in blkdev_writepages()
xfs, dax: introduce xfs_dax_aops
ext4, dax: introduce ext4_dax_aops
ext2, dax: introduce ext2_dax_aops
fs, dax: use page->mapping to warn if truncate collides with a busy page
dax: introduce CONFIG_DAX_DRIVER
dax, dm: allow device-mapper to operate without dax support
dax, dm: introduce ->fs_{claim,release}() dax_device infrastructure
mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks
memremap: split devm_memremap_pages() and memremap() infrastructure
mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS
memremap: mark devm_memremap_pages() EXPORT_SYMBOL_GPL
mm, fs, dax: handle layout changes to pinned dax mappings
xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
xfs: prepare xfs_break_layouts() for another layout type
xfs, dax: introduce xfs_break_dax_layouts()
drivers/dax/Kconfig | 5 +
drivers/dax/super.c | 118 +++++++++++++++++++---
drivers/md/Kconfig | 1
drivers/md/dm-linear.c | 6 +
drivers/md/dm-log-writes.c | 95 +++++++++---------
drivers/md/dm-stripe.c | 6 +
drivers/md/dm.c | 66 +++++++++++-
drivers/nvdimm/Kconfig | 2
drivers/nvdimm/pmem.c | 3 -
drivers/s390/block/Kconfig | 2
fs/Kconfig | 1
fs/block_dev.c | 5 -
fs/dax.c | 238 ++++++++++++++++++++++++++++++++++----------
fs/ext2/ext2.h | 1
fs/ext2/inode.c | 46 +++++----
fs/ext2/namei.c | 18 ---
fs/ext2/super.c | 6 +
fs/ext4/inode.c | 42 ++++++--
fs/ext4/super.c | 6 +
fs/libfs.c | 39 +++++++
fs/xfs/xfs_aops.c | 34 +++---
fs/xfs/xfs_aops.h | 1
fs/xfs/xfs_file.c | 73 ++++++++++++-
fs/xfs/xfs_inode.h | 16 +++
fs/xfs/xfs_ioctl.c | 8 -
fs/xfs/xfs_iops.c | 21 +++-
fs/xfs/xfs_pnfs.c | 16 ++-
fs/xfs/xfs_pnfs.h | 6 +
fs/xfs/xfs_super.c | 20 ++--
include/linux/dax.h | 115 ++++++++++++++++++---
include/linux/fs.h | 4 +
include/linux/memremap.h | 25 +----
include/linux/mm.h | 71 ++++++++++---
kernel/Makefile | 3 -
kernel/iomem.c | 167 +++++++++++++++++++++++++++++++
kernel/memremap.c | 210 +++++----------------------------------
mm/Kconfig | 5 +
mm/gup.c | 5 +
mm/hmm.c | 13 --
mm/swap.c | 3 -
40 files changed, 1047 insertions(+), 475 deletions(-)
create mode 100644 kernel/iomem.c
2 years, 11 months
[PATCH v3 00/11] Copy Offload in NVMe Fabrics with P2P PCI Memory
by Logan Gunthorpe
Hi Everyone,
Here's v3 of our series to introduce P2P based copy offload to NVMe
fabrics. This version has been rebased onto v4.16-rc5.
Thanks,
Logan
Changes in v3:
* Many more fixes and minor cleanups that were spotted by Bjorn
* Additional explanation of the ACS change in both the commit message
and Kconfig doc. Also, the code that disables the ACS bits is surrounded
explicitly by an #ifdef
* Removed the flag we added to rdma_rw_ctx() in favour of using
is_pci_p2pdma_page(), as suggested by Sagi.
* Adjust pci_p2pmem_find() so that it prefers P2P providers that
are closest to (or the same as) the clients using them. In cases
of ties, the provider is randomly chosen.
* Modify the NVMe Target code so that the PCI device name of the provider
may be explicitly specified, bypassing the logic in pci_p2pmem_find().
(Note: it's still enforced that the provider must be behind the
same switch as the clients).
* As requested by Bjorn, added documentation for driver writers.
Changes in v2:
* Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
as a bunch of cleanup and spelling fixes he pointed out in the last
series.
* To address Alex's ACS concerns, we change to a simpler method of
just disabling ACS behind switches for any kernel that has
CONFIG_PCI_P2PDMA.
* We also reject using devices that employ 'dma_virt_ops' which should
fairly simply handle Jason's concerns that this work might break with
the HFI, QIB and rxe drivers that use the virtual ops to implement
their own special DMA operations.
--
This is a continuation of our work to enable using Peer-to-Peer PCI
memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who
provided valuable feedback to get these patches to where they are today.
The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVME target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU). However, presently, the
trade-off is currently a reduction in overall throughput. (Largely due
to hardware issues that would certainly improve in the future).
Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch. This will mean many setups that could likely
work well will not be supported so that we can be more confident it
will work and not place any responsibility on the user to understand
their topology. (We chose to go this route based on feedback we
received at the last LSF). Future work may enable these transfers behind
a fabric of PCI switches or perhaps using a white list of known good
root complexes.
In order to enable this functionality, we introduce a few new PCI
functions such that a driver can register P2P memory with the system.
Struct pages are created for this memory using devm_memremap_pages()
and the PCI bus offset is stored in the corresponding pagemap structure.
Another set of functions allow a client driver to create a list of
client devices that will be used in a given P2P transactions and then
use that list to find any P2P memory that is supported by all the
client devices. This list is then also used to selectively disable the
ACS bits for the downstream ports behind these devices.
In the block layer, we also introduce a P2P request flag to indicate a
given request targets P2P memory as well as a flag for a request queue
to indicate a given queue supports targeting P2P memory. P2P requests
will only be accepted by queues that support it. Also, P2P requests
are marked to not be merged seeing a non-homogenous request would
complicate the DMA mapping requirements.
In the PCI NVMe driver, we modify the existing CMB support to utilize
the new PCI P2P memory infrastructure and also add support for P2P
memory in its request queue. When a P2P request is received it uses the
pci_p2pmem_map_sg() function which applies the necessary transformation
to get the corrent pci_bus_addr_t for the DMA transactions.
In the RDMA core, we also adjust rdma_rw_ctx_init() and
rdma_rw_ctx_destroy() to take a flags argument which indicates whether
to use the PCI P2P mapping functions or not.
Finally, in the NVMe fabrics target port we introduce a new
configuration boolean: 'allow_p2pmem'. When set, the port will attempt
to find P2P memory supported by the RDMA NIC and all namespaces. If
supported memory is found, it will be used in all IO transfers. And if
a port is using P2P memory, adding new namespaces that are not supported
by that memory will fail.
Logan Gunthorpe (11):
PCI/P2PDMA: Support peer-to-peer memory
PCI/P2PDMA: Add sysfs group to display p2pmem stats
PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
PCI/P2PDMA: Add P2P DMA driver writer's documentation
block: Introduce PCI P2P flags for request and request queue
IB/core: Ensure we map P2P memory correctly in
rdma_rw_ctx_[init|destroy]()
nvme-pci: Use PCI p2pmem subsystem to manage the CMB
nvme-pci: Add support for P2P memory in requests
nvme-pci: Add a quirk for a pseudo CMB
nvmet: Optionally use PCI P2P memory
Documentation/ABI/testing/sysfs-bus-pci | 25 +
Documentation/PCI/index.rst | 14 +
Documentation/PCI/p2pdma.rst | 164 +++++++
Documentation/index.rst | 3 +-
block/blk-core.c | 3 +
drivers/infiniband/core/rw.c | 13 +-
drivers/nvme/host/core.c | 4 +
drivers/nvme/host/nvme.h | 8 +
drivers/nvme/host/pci.c | 118 +++--
drivers/nvme/target/configfs.c | 67 +++
drivers/nvme/target/core.c | 106 +++-
drivers/nvme/target/io-cmd.c | 3 +
drivers/nvme/target/nvmet.h | 12 +
drivers/nvme/target/rdma.c | 32 +-
drivers/pci/Kconfig | 25 +
drivers/pci/Makefile | 1 +
drivers/pci/p2pdma.c | 828 ++++++++++++++++++++++++++++++++
drivers/pci/pci.c | 6 +
include/linux/blk_types.h | 18 +-
include/linux/blkdev.h | 3 +
include/linux/memremap.h | 19 +
include/linux/pci-p2pdma.h | 119 +++++
include/linux/pci.h | 4 +
23 files changed, 1543 insertions(+), 52 deletions(-)
create mode 100644 Documentation/PCI/index.rst
create mode 100644 Documentation/PCI/p2pdma.rst
create mode 100644 drivers/pci/p2pdma.c
create mode 100644 include/linux/pci-p2pdma.h
--
2.11.0
2 years, 12 months
[PATCH 0/18 v6] dax, ext4, xfs: Synchronous page faults
by Jan Kara
Hello,
here is the sixth version of my patches to implement synchronous page faults
for DAX mappings to make flushing of DAX mappings possible from userspace so
that they can be flushed on finer than page granularity and also avoid the
overhead of a syscall.
I think we are ready to get this merged - I've talked to Dan and he said he
could take the patches through his tree. It would just be nice to get final
ack from Christoph for the first patch implementing MAP_VALIDATE and someone
from XFS folks to check patch 17 (make xfs_filemap_pfn_mkwrite use
__xfs_filemap_fault()).
---
We use a new mmap flag MAP_SYNC to indicate that page faults for the mapping
should be synchronous. The guarantee provided by this flag is: While a block
is writeably mapped into page tables of this mapping, it is guaranteed to be
visible in the file at that offset also after a crash.
How I implement this is that ->iomap_begin() indicates by a flag that inode
block mapping metadata is unstable and may need flushing (use the same test as
whether fdatasync() has metadata to write). If yes, DAX fault handler refrains
from inserting / write-enabling the page table entry and returns special flag
VM_FAULT_NEEDDSYNC together with a PFN to map to the filesystem fault handler.
The handler then calls fdatasync() (vfs_fsync_range()) for the affected range
and after that calls DAX code to update the page table entry appropriately.
I did some basic performance testing on the patches over ramdisk - timed
latency of page faults when faulting 512 pages. I did several tests: with file
preallocated / with file empty, with background file copying going on / without
it, with / without MAP_SYNC (so that we get comparison). The results are
(numbers are in microseconds):
File preallocated, no background load no MAP_SYNC:
min=9 avg=10 max=46
8 - 15 us: 508
16 - 31 us: 3
32 - 63 us: 1
File preallocated, no background load, MAP_SYNC:
min=9 avg=10 max=47
8 - 15 us: 508
16 - 31 us: 2
32 - 63 us: 2
File empty, no background load, no MAP_SYNC:
min=21 avg=22 max=70
16 - 31 us: 506
32 - 63 us: 5
64 - 127 us: 1
File empty, no background load, MAP_SYNC:
min=40 avg=124 max=242
32 - 63 us: 1
64 - 127 us: 333
128 - 255 us: 178
File empty, background load, no MAP_SYNC:
min=21 avg=23 max=67
16 - 31 us: 507
32 - 63 us: 4
64 - 127 us: 1
File empty, background load, MAP_SYNC:
min=94 avg=112 max=181
64 - 127 us: 489
128 - 255 us: 23
So here we can see the difference between MAP_SYNC vs non MAP_SYNC is about
100-200 us when we need to wait for transaction commit in this setup.
Changes since v5:
* really updated the manpage
* improved comment describing IOMAP_F_DIRTY
* fixed XFS handling of VM_FAULT_NEEDSYNC in xfs_filemap_pfn_mkwrite()
Changes since v4:
* fixed couple of minor things in the manpage
* make legacy mmap flags always supported, remove them from mask declared
to be supported by ext4 and xfs
Changes since v3:
* updated some changelogs
* folded fs support for VM_SYNC flag into patches implementing the
functionality
* removed ->mmap_validate, use ->mmap_supported_flags instead
* added some Reviewed-by tags
* added manpage patch
Changes since v2:
* avoid unnecessary flushing of faulted page (Ross) - I've realized it makes no
sense to remeasure my benchmark results (after actually doing that and seeing
no difference, sigh) since I use ramdisk and not real PMEM HW and so flushes
are ignored.
* handle nojournal mode of ext4
* other smaller cleanups & fixes (Ross)
* factor larger part of finishing of synchronous fault into a helper (Christoph)
* reorder pfnp argument of dax_iomap_fault() (Christoph)
* add XFS support from Christoph
* use proper MAP_SYNC support in mmap(2)
* rebased on top of 4.14-rc4
Changes since v1:
* switched to using mmap flag MAP_SYNC
* cleaned up fault handlers to avoid passing pfn in vmf->orig_pte
* switched to not touching page tables before we are ready to insert final
entry as it was unnecessary and not really simplifying anything
* renamed fault flag to VM_FAULT_NEEDDSYNC
* other smaller fixes found by reviewers
Honza
3 years
[RFC PATCH v4] ndctl: monitor: add ndctl monitor daemon
by QI Fuli
This is the v4 patch for ndctl monitor daemon, a tiny daemon to monitor the
smart events of nvdimm DIMMs. Users can run a monitor as a one-shot command
or a daemon in background by using the [--daemon] option. DIMMs to monitor
can be selected by [--dimm] [--bus] [--region] [--namespace] options.
When a smart event fires, monitor daemon will log the notification which
including dimm health status to syslog or a logfile by setting [--log] option.
The notification follows json format and can be consumed by log collectors
like Fluented.
For example, a monitor daemon can be started by the following command:
# ndctl monitor --dimm nmem1 --log /var/log/monitor.log --daemon daemon-name
Then check the monitor daemon status by using systemd:
# systemctl status ndctl-monitor(a)daemon-name.service
To stop the monitor daemon by:
# systemctl stop ndctl-monitor(a)daemon-name.service
Also, a monitor daemon can be started by systemd:
# systemctl start ndctl-monitor.service
Which monitors all dimms.
In this implemention, when a ndctl monitor starts with [--daemon] option, all
the arguments will be saved into a temp file named as daemon-name and placed
under /etc/sysconfig/ndctl/ directory. The temp file would be used as an
EnvironmentFile by systemd, and it would be deleted automatically when the
systemd service is stopped.
Due to the deletion the following commands will not work.
# systemctl enable ndctl-monitor(a)daemon-name.service
# systemctl restart ndctl-monitor(a)daemon-name.service
I am not sure whether these commands are needed for ndctl monitor daemon,
your comments will be appreciated.
Signed-off-by: QI Fuli <qi.fuli(a)jp.fujitsu.com>
Change log since v3:
- Removing create-monitor, show-monitor, list-monitor, destroy-monitor
- Adding [--daemon] option to run ndctl monitor as a daemon
- Using systemd to manage ndctl monitor daemon
- Replacing filter_monitor_dimm() with filter_dimm()
Change log since v2:
- Changing the interface of daemon to the ndctl command line
- Changing the name of daemon form "nvdimmd" to "monitor"
- Removing the config file, unit_file, nvdimmd dir
- Removing nvdimmd_test program
- Adding ndctl/monitor.c
Change log since v1:
- Adding a config file(/etc/nvdimmd/nvdimmd.conf)
- Using struct log_ctx instead of syslog()
- Using log_syslog() to save the notify messages to syslog
- Using log_file() to save the notify messages to special file
- Adding LOG_NOTICE level to log_priority
- Using automake instead of Makefile
- Adding a new util file(nvdimmd/util.c) including helper functions
needed for nvdimm daemon
- Adding nvdimmd_test program
---
builtin.h | 1 +
ndctl/Makefile.am | 13 +-
ndctl/monitor.c | 411 +++++++++++++++++++++++++++++++++++++++++++
ndctl/ndctl-monitor.service | 7 +
ndctl/ndctl-monitor@.service | 9 +
ndctl/ndctl.c | 1 +
util/filter.c | 5 +-
util/filter.h | 3 +
util/parse-options.h | 1 +
9 files changed, 448 insertions(+), 3 deletions(-)
create mode 100644 ndctl/monitor.c
create mode 100644 ndctl/ndctl-monitor.service
create mode 100644 ndctl/ndctl-monitor@.service
diff --git a/builtin.h b/builtin.h
index b24fc99..4b908f0 100644
--- a/builtin.h
+++ b/builtin.h
@@ -36,6 +36,7 @@ int cmd_write_labels(int argc, const char **argv, void *ctx);
int cmd_init_labels(int argc, const char **argv, void *ctx);
int cmd_check_labels(int argc, const char **argv, void *ctx);
int cmd_inject_error(int argc, const char **argv, void *ctx);
+int cmd_monitor(int argc, const char **argv, void *ctx);
int cmd_list(int argc, const char **argv, void *ctx);
#ifdef ENABLE_TEST
int cmd_test(int argc, const char **argv, void *ctx);
diff --git a/ndctl/Makefile.am b/ndctl/Makefile.am
index e0db97b..e364ef9 100644
--- a/ndctl/Makefile.am
+++ b/ndctl/Makefile.am
@@ -16,7 +16,8 @@ ndctl_SOURCES = ndctl.c \
util/json-firmware.c \
inject-error.c \
update.c \
- inject-smart.c
+ inject-smart.c \
+ monitor.c
if ENABLE_DESTRUCTIVE
ndctl_SOURCES += ../test/blk_namespaces.c \
@@ -41,3 +42,13 @@ ndctl_SOURCES += ../test/libndctl.c \
../test/multi-pmem.c \
../test/core.c
endif
+
+unitfiles =\
+ ndctl-monitor.service \
+ ndctl-monitor@.service
+
+unitdir = /usr/lib/systemd/system/
+
+unit_DATA = $(unitfiles)
+
+EXTRA_DIST = $(unitfiles)
diff --git a/ndctl/monitor.c b/ndctl/monitor.c
new file mode 100644
index 0000000..2164f27
--- /dev/null
+++ b/ndctl/monitor.c
@@ -0,0 +1,411 @@
+/*
+ * Copyright (c) 2018, FUJITSU LIMITED. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU Lesser General Public License,
+ * version 2.1, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT ANY
+ * WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+ * FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for
+ * more details.
+ */
+#include <stdio.h>
+#include <json-c/json.h>
+#include <signal.h>
+#include <libgen.h>
+#include <dirent.h>
+#include <util/parse-options.h>
+#include <util/log.h>
+#include <util/json.h>
+#include <util/filter.h>
+#include <ndctl/lib/private.h>
+#include <ndctl/libndctl.h>
+#include <sys/stat.h>
+#define NUM_MAX_DIMM 1024
+#define BUF_SIZE 4096
+
+struct monitor_dimm {
+ struct ndctl_dimm *dimm;
+ int health_eventfd;
+ struct list_node list;
+};
+
+struct monitor_filter_arg {
+ struct list_head mdimm;
+ int maxfd;
+ fd_set fds;
+ int num_dimm;
+ unsigned long flags;
+};
+
+struct util_filter_params param;
+
+static char *conf_dir = "/etc/sysconfig/ndctl/";
+static char *def_log_dir = "/var/log/ndctl/";
+
+static char *concat(char *str1, char *str2)
+{
+ char *result = malloc(strlen(str1) + strlen(str2) + 1);
+ strcpy(result, str1);
+ strcat(result, str2);
+ return result;
+}
+
+static bool is_dir(char *filepath)
+{
+ DIR *dir = opendir(filepath);
+ if (dir) {
+ closedir(dir);
+ return true;
+ }
+ return false;
+}
+
+static int recur_mkdir(char *filepath, mode_t mode)
+{
+ char *p;
+ char *buf = (char *)malloc(strlen(filepath) + 4);
+
+ strcpy(buf, filepath);
+ for (p = strchr(buf + 1, '/'); p; p = strchr(p + 1, '/')) {
+ *p = '\0';
+ if (!is_dir(buf)) {
+ if (mkdir(buf, mode) < 0) {
+ free(buf);
+ return -1;
+ }
+ }
+ *p = '/';
+ }
+
+ free(buf);
+ return 0;
+}
+
+static void log_file(struct ndctl_ctx *ctx, int priority, const char *file,
+ int line, const char *fn, const char *format, va_list args)
+{
+ FILE *f;
+ char *log_name, *log_dir, *buf;
+ char *errmsg = (char *)malloc(BUF_SIZE);
+ char *tail = "/";
+
+ log_dir = dirname(strdup(param.log));
+ log_name = basename(strdup(param.log));
+ if (strcmp(log_dir, ".") == 0)
+ log_dir = def_log_dir;
+ else
+ log_dir = concat(log_dir, tail);
+ if (log_dir[0] != '/')
+ log_dir = concat(def_log_dir, log_dir);
+ log_name = concat(log_dir, log_name);
+
+ if (!is_dir(log_dir)) {
+ if (recur_mkdir(log_dir, 0744) != 0) {
+ sprintf(errmsg, "cannot create dir: %s", log_dir);
+ goto out;
+ }
+ }
+
+ f = fopen(log_name, "a+");
+ if (!f) {
+ sprintf(errmsg, "open %s failed", log_name);
+ goto out;
+ }
+
+ buf = (char *)malloc(BUF_SIZE);
+ if (!buf) {
+ sprintf(errmsg, "cannot get memory for log_file");
+ goto out;
+ }
+ vsnprintf(buf, BUF_SIZE, format, args);
+ fprintf(f, "%s\n", buf);
+ free(buf);
+ fclose(f);
+ return;
+out:
+ syslog(LOG_ERR, "%s\n", errmsg);
+ if (!param.fork)
+ error("%s\n", errmsg);
+ free(errmsg);
+ exit(EXIT_FAILURE);
+}
+
+static void log_syslog(struct ndctl_ctx *ctx, int priority, const char *file,
+ int line, const char *fn, const char *format, va_list args)
+{
+ char *buf = (char *)malloc(BUF_SIZE);
+ vsnprintf(buf, BUF_SIZE, format, args);
+ syslog(priority, "%s", buf);
+ free(buf);
+}
+
+#define fail(fmt, ...) \
+do { \
+ err(ctx, "ndctl-%s:%s:%d: " fmt, \
+ VERSION, __func__, __LINE__, ##__VA_ARGS__); \
+ if (!param.fork) \
+ fprintf(stderr, "ndctl-%s:%s:%d: " fmt, \
+ VERSION, __func__, __LINE__, ##__VA_ARGS__); \
+} while (0)
+
+static int notify_json_msg(struct ndctl_ctx *ctx, struct ndctl_dimm *dimm)
+{
+ time_t c_time;
+ char date[32];
+ struct json_object *jmsg, *jdatetime, *jpid, *jdimm, *jhealth;
+
+ jmsg = json_object_new_object();
+ if (!jmsg) {
+ fail("\n");
+ return -1;
+ }
+
+ c_time = time(NULL);
+ strftime(date, sizeof(date), "%Y-%m-%d %H:%M:%S", localtime(&c_time));
+ jdatetime = json_object_new_string(date);
+ if (!jdatetime) {
+ fail("\n");
+ return -1;
+ }
+ json_object_object_add(jmsg, "datetime", jdatetime);
+
+ jpid = json_object_new_int((int)getpid());
+ if (!jpid) {
+ fail("\n");
+ return -1;
+ }
+ json_object_object_add(jmsg, "pid", jpid);
+
+ jdimm = util_dimm_to_json(dimm, 0);
+ if (!dimm) {
+ fail("\n");
+ return -1;
+ }
+ json_object_object_add(jmsg, "dimm", jdimm);
+
+ jhealth = util_dimm_health_to_json(dimm);
+ if (!jhealth) {
+ fail("\n");
+ return -1;
+ }
+ json_object_object_add(jdimm, "health", jhealth);
+
+ notice(ctx, "%s",
+ json_object_to_json_string_ext(jmsg, JSON_C_TO_STRING_PLAIN));
+ if (!param.fork)
+ printf("%s\n", json_object_to_json_string_ext(jmsg,
+ JSON_C_TO_STRING_PRETTY));
+ return 0;
+}
+
+static void filter_dimm(struct ndctl_dimm *dimm, struct util_filter_ctx *ctx)
+{
+ struct monitor_filter_arg *mfa = (struct monitor_filter_arg *)ctx->arg;
+ int fd;
+ char buf[BUF_SIZE];
+
+ if (!ndctl_dimm_is_cmd_supported(dimm, ND_CMD_SMART_THRESHOLD))
+ return;
+
+ struct monitor_dimm *m_dimm = malloc(sizeof(*m_dimm));
+ m_dimm->dimm = dimm;
+ fd = ndctl_dimm_get_health_eventfd(dimm);
+ pread(fd, buf, sizeof(buf), 0);
+ m_dimm->health_eventfd = fd;
+ list_add_tail(&mfa->mdimm, &m_dimm->list);
+ FD_SET(fd, &mfa->fds);
+ if (fd > mfa->maxfd)
+ mfa->maxfd = fd;
+ mfa->num_dimm++;
+}
+
+static int monitor_smart_event(struct ndctl_ctx *ctx)
+{
+ struct util_filter_ctx fctx = { 0 };
+ struct monitor_filter_arg mfa = { 0 };
+ int rc;
+ char buf[BUF_SIZE];
+ char *errmsg = (char *)malloc(BUF_SIZE);
+
+ fctx.filter_dimm = filter_dimm;
+ fctx.arg = &mfa;
+ mfa.flags = 0;
+ list_head_init(&mfa.mdimm);
+ FD_ZERO(&mfa.fds);
+
+ rc = util_filter_walk(ctx, &fctx, ¶m);
+ if (rc)
+ goto out;
+ if (mfa.num_dimm == 0) {
+ errmsg = "no monitor dimms can be found";
+ goto out;
+ }
+
+ while(1){
+ rc = select(mfa.maxfd + 1, NULL, NULL, &mfa.fds, NULL);
+ if (rc < 1) {
+ errmsg = "select error";
+ if (rc == 0)
+ dbg(ctx, "select unexpected timeout\n");
+ else
+ dbg(ctx, "select %s\n", strerror(errno));
+ goto out;
+ }
+ struct monitor_dimm *m_dimm;
+ list_for_each(&mfa.mdimm, m_dimm, list) {
+ if (!FD_ISSET(m_dimm->health_eventfd, &mfa.fds)) {
+ FD_SET(m_dimm->health_eventfd, &mfa.fds);
+ continue;
+ }
+ if (notify_json_msg(ctx, m_dimm->dimm) != 0)
+ goto out;
+ pread(m_dimm->health_eventfd, buf, sizeof(buf), 0);
+ }
+ }
+ return 0;
+out:
+ if (errmsg) {
+ if (!param.fork)
+ error("%s\n", errmsg);
+ err(ctx, "%s\n", errmsg);
+ }
+ return 1;
+}
+
+static int create_confile(char *conf_path)
+{
+ FILE *f;
+ char *buf;
+
+ if (!is_dir(conf_dir)) {
+ if (recur_mkdir(conf_dir, 0744) != 0) {
+ error("cannot create dir: %s\n", conf_dir);
+ goto out;
+ }
+ }
+
+ f = fopen(conf_path, "w");
+ if (!f) {
+ error("open %s failed\n", conf_path);
+ goto out;
+ }
+
+ buf = (char *)malloc(BUF_SIZE);
+ if (!buf) {
+ error("cannot get memory for daemon config file\n");
+ goto out;
+ }
+ strcpy(buf, "OPTIONS=-f");
+ if (param.bus) {
+ strcat(buf, " -b ");
+ strcat(buf, param.bus);
+ }
+ if (param.dimm) {
+ strcat(buf, " -d ");
+ strcat(buf, param.dimm);
+ }
+ if (param.namespace) {
+ strcat(buf, " -n ");
+ strcat(buf, param.namespace);
+ }
+ if (param.region) {
+ strcat(buf, " -r ");
+ strcat(buf, param.region);
+ }
+ if (param.log) {
+ strcat(buf, " -l ");
+ strcat(buf, param.log);
+ }
+ fprintf(f, "%s", buf);
+ fclose(f);
+ free(buf);
+ return 0;
+out:
+ return 1;
+}
+
+static bool is_monitor_exist(void)
+{
+ char *conf_path = strdup(param.daemon);
+ conf_path = concat(conf_dir, conf_path);
+ FILE *f = fopen(conf_path, "r");
+ if (f) {
+ fclose(f);
+ return true;
+ }
+ return false;
+}
+
+int cmd_monitor(int argc, const char **argv, void *ctx)
+{
+ const struct option options[] = {
+ OPT_STRING('b', "bus", ¶m.bus, "bus-id", "filter by bus"),
+ OPT_STRING('r', "region", ¶m.region, "region-id",
+ "filter by region"),
+ OPT_STRING('d', "dimm", ¶m.dimm, "dimm-id",
+ "filter by dimm"),
+ OPT_STRING('n', "namespace", ¶m.namespace,
+ "namespace-id", "filter by namespace id"),
+ OPT_STRING('l', "log", ¶m.log, "log name",
+ "monitor logfile"),
+ OPT_STRING('D',"daemon", ¶m.daemon, "daemon-name",
+ "run ndctl monitor as a daemon"),
+ OPT_BOOLEAN_HID('f', ¶m.fork),
+ OPT_END(),
+ };
+ const char * const u[] = {
+ "ndctl monitor [<options>]",
+ NULL
+ };
+ argc = parse_options(argc, argv, options, u, 0);
+ for (int i = 0; i < argc; i++) {
+ error("unknown parameter \"%s\"\n", argv[i]);
+ goto out;
+ }
+ if (argc)
+ usage_with_options(u, options);
+
+ if (param.daemon) {
+ if (is_monitor_exist()) {
+ error("monitor %s is exist\n", param.daemon);
+ goto out;
+ }
+ char *conf_path = strdup(param.daemon);
+ conf_path = concat(conf_dir, conf_path);
+ if (create_confile(conf_path) != 0)
+ goto out;
+ char *sys_cmd = (char *)malloc(BUF_SIZE);
+ sprintf(sys_cmd, "systemctl start ndctl-monitor(a)%s.service",
+ param.daemon);
+ if (system(sys_cmd) != 0) {
+ free(sys_cmd);
+ remove(conf_path);
+ goto out;
+ }
+ free(sys_cmd);
+ return 0;
+ }
+
+ if (param.fork) {
+ if (daemon(0, 0) != 0) {
+ err((struct ndctl_ctx*)ctx, "daemon start failed\n");
+ goto out;
+ }
+ }
+
+ ndctl_set_log_priority((struct ndctl_ctx*)ctx, LOG_NOTICE);
+
+ if (!param.log || strcmp(param.log, "syslog") == 0)
+ ndctl_set_log_fn((struct ndctl_ctx*)ctx, log_syslog);
+ else
+ ndctl_set_log_fn((struct ndctl_ctx*)ctx, log_file);
+
+ if (monitor_smart_event((struct ndctl_ctx*)ctx) != 0)
+ goto out;
+
+ return 0;
+out:
+ return 1;
+}
diff --git a/ndctl/ndctl-monitor.service b/ndctl/ndctl-monitor.service
new file mode 100644
index 0000000..2f1417b
--- /dev/null
+++ b/ndctl/ndctl-monitor.service
@@ -0,0 +1,7 @@
+[Unit]
+Description=Ndctl Monitor Daemon
+
+[Service]
+Type=forking
+ExecStart=/usr/bin/ndctl monitor -f
+ExecStop=/usr/bin/kill ${MAINPID}
diff --git a/ndctl/ndctl-monitor@.service b/ndctl/ndctl-monitor@.service
new file mode 100644
index 0000000..91c85d9
--- /dev/null
+++ b/ndctl/ndctl-monitor@.service
@@ -0,0 +1,9 @@
+[Unit]
+Description=Ndctl Monitor Daemon
+
+[Service]
+Type=forking
+EnvironmentFile=/etc/sysconfig/ndctl/%i
+ExecStart=/usr/bin/ndctl monitor $OPTIONS
+ExecStop=/usr/bin/kill ${MAINPID}
+ExecStopPost=/usr/bin/rm /etc/sysconfig/ndctl/%i
diff --git a/ndctl/ndctl.c b/ndctl/ndctl.c
index d3c6db1..8938621 100644
--- a/ndctl/ndctl.c
+++ b/ndctl/ndctl.c
@@ -86,6 +86,7 @@ static struct cmd_struct commands[] = {
{ "inject-error", cmd_inject_error },
{ "update-firmware", cmd_update_firmware },
{ "inject-smart", cmd_inject_smart },
+ { "monitor", cmd_monitor },
{ "list", cmd_list },
{ "help", cmd_help },
#ifdef ENABLE_TEST
diff --git a/util/filter.c b/util/filter.c
index b0b7fdf..fba2197 100644
--- a/util/filter.c
+++ b/util/filter.c
@@ -315,7 +315,7 @@ int util_filter_walk(struct ndctl_ctx *ctx, struct util_filter_ctx *fctx,
|| !util_bus_filter_by_namespace(bus, param->namespace))
continue;
- if (!fctx->filter_bus(bus, fctx))
+ if (fctx->filter_bus && !fctx->filter_bus(bus, fctx))
continue;
ndctl_dimm_foreach(bus, dimm) {
@@ -345,7 +345,8 @@ int util_filter_walk(struct ndctl_ctx *ctx, struct util_filter_ctx *fctx,
if (type && ndctl_region_get_type(region) != type)
continue;
- if (!fctx->filter_region(region, fctx))
+ if (fctx->filter_region &&
+ !fctx->filter_region(region, fctx))
continue;
ndctl_namespace_foreach(region, ndns) {
diff --git a/util/filter.h b/util/filter.h
index aea5a71..82f3b0d 100644
--- a/util/filter.h
+++ b/util/filter.h
@@ -77,6 +77,9 @@ struct util_filter_params {
const char *dimm;
const char *mode;
const char *namespace;
+ const char *log;
+ const char *daemon;
+ bool fork;
};
struct ndctl_ctx;
diff --git a/util/parse-options.h b/util/parse-options.h
index 6fd6b24..2262c42 100644
--- a/util/parse-options.h
+++ b/util/parse-options.h
@@ -123,6 +123,7 @@ struct option {
#define OPT_GROUP(h) { .type = OPTION_GROUP, .help = (h) }
#define OPT_BIT(s, l, v, h, b) { .type = OPTION_BIT, .short_name = (s), .long_name = (l), .value = check_vtype(v, int *), .help = (h), .defval = (b) }
#define OPT_BOOLEAN(s, l, v, h) { .type = OPTION_BOOLEAN, .short_name = (s), .long_name = (l), .value = check_vtype(v, bool *), .help = (h) }
+#define OPT_BOOLEAN_HID(s, v) { .type = OPTION_BOOLEAN, .short_name = (s), .value = check_vtype(v, bool *), .flags = PARSE_OPT_HIDDEN}
#define OPT_BOOLEAN_SET(s, l, v, os, h) \
{ .type = OPTION_BOOLEAN, .short_name = (s), .long_name = (l), \
.value = check_vtype(v, bool *), .help = (h), \
--
2.9.5
3 years
[PATCH 1/3] Avoid filename truncation in numastat
by Ross Zwisler
gcc 7.3.1 provides the following warning when compiling numastat.c:
numastat.c: In function ‘add_pids_from_pattern_search’:
numastat.c:1316:41: warning: ‘%s’ directive output may be truncated writing
up to 255 bytes into a region of size 58 [-Wformat-truncation=]
snprintf(fname, sizeof(fname), "/proc/%s/cmdline", namelist[ix]->d_name);
^~
numastat.c:1316:3: note: ‘snprintf’ output between 15 and 270 bytes into a
destination of size 64
snprintf(fname, sizeof(fname), "/proc/%s/cmdline", namelist[ix]->d_name);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is valid - namelist[ix]->d_name is size 256 bytes, we have some extra
bytes as part of our format string. Our destination buffer, 'fname', is
only 64 bytes wide.
Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
---
numastat.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/numastat.c b/numastat.c
index e0a5639..2d413df 100644
--- a/numastat.c
+++ b/numastat.c
@@ -1312,7 +1312,7 @@ void add_pids_from_pattern_search(char *pattern) {
}
// Next copy cmdline file contents onto end of buffer. Do it a
// character at a time to convert nulls to spaces.
- char fname[64];
+ char fname[272];
snprintf(fname, sizeof(fname), "/proc/%s/cmdline", namelist[ix]->d_name);
FILE *fs = fopen(fname, "r");
if (fs) {
--
2.14.3
3 years