[PATCH v6 0/8] Support new pmem flush and sync instructions for POWER
by Aneesh Kumar K.V
This patch series enables the usage os new pmem flush and sync instructions on POWER
architecture. POWER10 introduces two new variants of dcbf instructions (dcbstps and dcbfps)
that can be used to write modified locations back to persistent storage. Additionally,
POWER10 also introduce phwsync and plwsync which can be used to establish order of these
writes to persistent storage.
This series exposes these instructions to the rest of the kernel. The existing
dcbf and hwsync instructions in P8 and P9 are adequate to enable appropriate
synchronization with OpenCAPI-hosted persistent storage. Hence the new instructions
are added as a variant of the old ones that old hardware won't differentiate.
On POWER10, pmem devices will be represented by a different device tree compat
strings. This ensures that older kernels won't initialize pmem devices on POWER10.
With this:
1) vPMEM continues to work since it is a volatile region. That
doesn't need any flush instructions.
2) pmdk and other user applications get updated to use new instructions
and updated packages are made available to all distributions
3) On newer hardware, the device will appear with a new compat string.
Hence older distributions won't initialize pmem on newer hardware.
Changes from v5:
* Drop CONFIG_ARCH_MAP_SYNC_DISABLE and related changes
Changes from V4:
* Add namespace specific sychronous fault control.
Changes from V3:
* Add new compat string to be used for the device.
* Use arch_pmem_flush_barrier() in dm-writecache.
Aneesh Kumar K.V (8):
powerpc/pmem: Restrict papr_scm to P8 and above.
powerpc/pmem: Add new instructions for persistent storage and sync
powerpc/pmem: Add flush routines using new pmem store and sync
instruction
libnvdimm/nvdimm/flush: Allow architecture to override the flush
barrier
powerpc/pmem/of_pmem: Update of_pmem to use the new barrier
instruction.
powerpc/pmem: Avoid the barrier in flush routines
powerpc/pmem: Add WARN_ONCE to catch the wrong usage of pmem flush
functions.
powerpc/pmem: Initialize pmem device on newer hardware
arch/powerpc/include/asm/cacheflush.h | 10 +++++
arch/powerpc/include/asm/ppc-opcode.h | 12 ++++++
arch/powerpc/lib/pmem.c | 46 +++++++++++++++++++++--
arch/powerpc/platforms/pseries/papr_scm.c | 14 +++++++
arch/powerpc/platforms/pseries/pmem.c | 6 +++
drivers/md/dm-writecache.c | 2 +-
drivers/nvdimm/of_pmem.c | 1 +
drivers/nvdimm/region_devs.c | 8 ++--
include/asm-generic/cacheflush.h | 4 ++
9 files changed, 94 insertions(+), 9 deletions(-)
--
2.26.2
6 months, 3 weeks
[PATCH] dm writecache: reject asynchronous pmem.
by Michal Suchanek
The writecache driver does not handle asynchronous pmem. Reject it when
supplied as cache.
Link: https://lore.kernel.org/linux-nvdimm/87lfk5hahc.fsf@linux.ibm.com/
Fixes: 6e84200c0a29 ("virtio-pmem: Add virtio pmem driver")
Signed-off-by: Michal Suchanek <msuchanek(a)suse.de>
---
drivers/md/dm-writecache.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
index 30505d70f423..57b0a972f6fd 100644
--- a/drivers/md/dm-writecache.c
+++ b/drivers/md/dm-writecache.c
@@ -2277,6 +2277,12 @@ static int writecache_ctr(struct dm_target *ti, unsigned argc, char **argv)
wc->memory_map_size -= (uint64_t)wc->start_sector << SECTOR_SHIFT;
+ if (!dax_synchronous(wc->ssd_dev->dax_dev)) {
+ r = -EOPNOTSUPP;
+ ti->error = "Asynchronous persistent memory not supported as pmem cache";
+ goto bad;
+ }
+
bio_list_init(&wc->flush_list);
wc->flush_thread = kthread_create(writecache_flush_thread, wc, "dm_writecache_flush");
if (IS_ERR(wc->flush_thread)) {
--
2.26.2
6 months, 3 weeks
[PATCH v2 1/5] powerpc/pmem: Add new instructions for persistent storage and sync
by Aneesh Kumar K.V
POWER10 introduces two new variants of dcbf instructions (dcbstps and dcbfps)
that can be used to write modified locations back to persistent storage.
Additionally, POWER10 also introduce phwsync and plwsync which can be used
to establish order of these writes to persistent storage.
This patch exposes these instructions to the rest of the kernel. The existing
dcbf and hwsync instructions in P9 are adequate to enable appropriate
synchronization with OpenCAPI-hosted persistent storage. Hence the new
instructions are added as a variant of the old ones that old hardware
won't differentiate.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar(a)linux.ibm.com>
---
arch/powerpc/include/asm/ppc-opcode.h | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h
index c1df75edde44..45eccd842f84 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -216,6 +216,8 @@
#define PPC_INST_STWCX 0x7c00012d
#define PPC_INST_LWSYNC 0x7c2004ac
#define PPC_INST_SYNC 0x7c0004ac
+#define PPC_INST_PHWSYNC 0x7c8004ac
+#define PPC_INST_PLWSYNC 0x7ca004ac
#define PPC_INST_SYNC_MASK 0xfc0007fe
#define PPC_INST_ISYNC 0x4c00012c
#define PPC_INST_LXVD2X 0x7c000698
@@ -281,6 +283,8 @@
#define PPC_INST_TABORT 0x7c00071d
#define PPC_INST_TSR 0x7c0005dd
+#define PPC_INST_DCBF 0x7c0000ac
+
#define PPC_INST_NAP 0x4c000364
#define PPC_INST_SLEEP 0x4c0003a4
#define PPC_INST_WINKLE 0x4c0003e4
@@ -529,6 +533,14 @@
#define STBCIX(s,a,b) stringify_in_c(.long PPC_INST_STBCIX | \
__PPC_RS(s) | __PPC_RA(a) | __PPC_RB(b))
+#define PPC_DCBFPS(a, b) stringify_in_c(.long PPC_INST_DCBF | \
+ ___PPC_RA(a) | ___PPC_RB(b) | (4 << 21))
+#define PPC_DCBSTPS(a, b) stringify_in_c(.long PPC_INST_DCBF | \
+ ___PPC_RA(a) | ___PPC_RB(b) | (6 << 21))
+
+#define PPC_PHWSYNC stringify_in_c(.long PPC_INST_PHWSYNC)
+#define PPC_PLWSYNC stringify_in_c(.long PPC_INST_PLWSYNC)
+
/*
* Define what the VSX XX1 form instructions will look like, then add
* the 128 bit load store instructions based on that.
--
2.26.2
6 months, 3 weeks
[RFC] Make the memory failure blast radius more precise
by Matthew Wilcox
Hardware actually tells us the blast radius of the error, but we ignore
it and take out the entire page. We've had a customer request to know
exactly how much of the page is damaged so they can avoid reconstructing
an entire 2MB page if only a single cacheline is damaged.
This is only a strawman that I did in an hour or two; I'd appreciate
architectural-level feedback. Should I just convert memory_failure() to
always take an address & granularity? Should I create a struct to pass
around (page, phys, granularity) instead of reconstructing the missing
pieces in half a dozen functions? Is this functionality welcome at all,
or is the risk of upsetting applications which expect at least a page
of granularity too high?
I can see places where I've specified a plain PAGE_SHIFT insted of
interrogating a compound page for its size. I'd probably split this
patch up into two or three pieces for applying.
I've also blindly taken out the call to unmap_mapping_range(). Again,
the customer requested that we not do this. That deserves to be in its
own patch and properly justified.
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index ce9120c4f740..09978c5b9493 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -463,7 +463,7 @@ static void mce_irq_work_cb(struct irq_work *entry)
* Check if the address reported by the CPU is in a format we can parse.
* It would be possible to add code for most other cases, but all would
* be somewhat complicated (e.g. segment offset would require an instruction
- * parser). So only support physical addresses up to page granuality for now.
+ * parser). So only support physical addresses up to page granularity for now.
*/
int mce_usable_address(struct mce *m)
{
@@ -570,7 +570,6 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
void *data)
{
struct mce *mce = (struct mce *)data;
- unsigned long pfn;
if (!mce || !mce_usable_address(mce))
return NOTIFY_DONE;
@@ -579,9 +578,8 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,
mce->severity != MCE_DEFERRED_SEVERITY)
return NOTIFY_DONE;
- pfn = mce->addr >> PAGE_SHIFT;
- if (!memory_failure(pfn, 0)) {
- set_mce_nospec(pfn, whole_page(mce));
+ if (!__memory_failure(mce->addr, MCI_MISC_ADDR_LSB(mce->misc), 0)) {
+ set_mce_nospec(mce->addr >> PAGE_SHIFT, whole_page(mce));
mce->kflags |= MCE_HANDLED_UC;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index dc7b87310c10..f2ea6491df28 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3035,7 +3035,14 @@ enum mf_flags {
MF_MUST_KILL = 1 << 2,
MF_SOFT_OFFLINE = 1 << 3,
};
-extern int memory_failure(unsigned long pfn, int flags);
+
+int __memory_failure(unsigned long addr, unsigned int granularity, int flags);
+
+static inline int memory_failure(unsigned long pfn, int flags)
+{
+ return __memory_failure(pfn << PAGE_SHIFT, PAGE_SHIFT, flags);
+}
+
extern void memory_failure_queue(unsigned long pfn, int flags);
extern void memory_failure_queue_kick(int cpu);
extern int unpoison_memory(unsigned long pfn);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 47b8ccb1fb9b..775b053dcd98 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -198,7 +198,7 @@ struct to_kill {
struct list_head nd;
struct task_struct *tsk;
unsigned long addr;
- short size_shift;
+ short granularity;
};
/*
@@ -209,7 +209,7 @@ struct to_kill {
static int kill_proc(struct to_kill *tk, unsigned long pfn, int flags)
{
struct task_struct *t = tk->tsk;
- short addr_lsb = tk->size_shift;
+ short addr_lsb = tk->granularity;
int ret = 0;
pr_err("Memory failure: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
@@ -262,40 +262,6 @@ void shake_page(struct page *p, int access)
}
EXPORT_SYMBOL_GPL(shake_page);
-static unsigned long dev_pagemap_mapping_shift(struct page *page,
- struct vm_area_struct *vma)
-{
- unsigned long address = vma_address(page, vma);
- pgd_t *pgd;
- p4d_t *p4d;
- pud_t *pud;
- pmd_t *pmd;
- pte_t *pte;
-
- pgd = pgd_offset(vma->vm_mm, address);
- if (!pgd_present(*pgd))
- return 0;
- p4d = p4d_offset(pgd, address);
- if (!p4d_present(*p4d))
- return 0;
- pud = pud_offset(p4d, address);
- if (!pud_present(*pud))
- return 0;
- if (pud_devmap(*pud))
- return PUD_SHIFT;
- pmd = pmd_offset(pud, address);
- if (!pmd_present(*pmd))
- return 0;
- if (pmd_devmap(*pmd))
- return PMD_SHIFT;
- pte = pte_offset_map(pmd, address);
- if (!pte_present(*pte))
- return 0;
- if (pte_devmap(*pte))
- return PAGE_SHIFT;
- return 0;
-}
-
/*
* Failure handling: if we can't find or can't kill a process there's
* not much we can do. We just print a message and ignore otherwise.
@@ -306,8 +272,8 @@ static unsigned long dev_pagemap_mapping_shift(struct page *page,
* Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
*/
static void add_to_kill(struct task_struct *tsk, struct page *p,
- struct vm_area_struct *vma,
- struct list_head *to_kill)
+ unsigned int offset, unsigned int granularity,
+ struct vm_area_struct *vma, struct list_head *to_kill)
{
struct to_kill *tk;
@@ -318,27 +284,14 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
}
tk->addr = page_address_in_vma(p, vma);
- if (is_zone_device_page(p))
- tk->size_shift = dev_pagemap_mapping_shift(p, vma);
- else
- tk->size_shift = page_shift(compound_head(p));
+ tk->granularity = granularity;
- /*
- * Send SIGKILL if "tk->addr == -EFAULT". Also, as
- * "tk->size_shift" is always non-zero for !is_zone_device_page(),
- * so "tk->size_shift == 0" effectively checks no mapping on
- * ZONE_DEVICE. Indeed, when a devdax page is mmapped N times
- * to a process' address space, it's possible not all N VMAs
- * contain mappings for the page, but at least one VMA does.
- * Only deliver SIGBUS with payload derived from the VMA that
- * has a mapping for the page.
- */
+ /* Send SIGKILL if "tk->addr == -EFAULT". */
if (tk->addr == -EFAULT) {
pr_info("Memory failure: Unable to find user space address %lx in %s\n",
page_to_pfn(p), tsk->comm);
- } else if (tk->size_shift == 0) {
- kfree(tk);
- return;
+ } else {
+ tk->addr += offset;
}
get_task_struct(tsk);
@@ -468,7 +421,8 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
if (!page_mapped_in_vma(page, vma))
continue;
if (vma->vm_mm == t->mm)
- add_to_kill(t, page, vma, to_kill);
+ add_to_kill(t, page, 0, PAGE_SHIFT, vma,
+ to_kill);
}
}
read_unlock(&tasklist_lock);
@@ -478,8 +432,8 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
/*
* Collect processes when the error hit a file mapped page.
*/
-static void collect_procs_file(struct page *page, struct list_head *to_kill,
- int force_early)
+static void collect_procs_file(struct page *page, unsigned int offset,
+ short granularity, struct list_head *to_kill, int force_early)
{
struct vm_area_struct *vma;
struct task_struct *tsk;
@@ -503,7 +457,8 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
* to be informed of all such data corruptions.
*/
if (vma->vm_mm == t->mm)
- add_to_kill(t, page, vma, to_kill);
+ add_to_kill(t, page, 0, PAGE_SHIFT, vma,
+ to_kill);
}
}
read_unlock(&tasklist_lock);
@@ -513,8 +468,8 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
/*
* Collect the processes who have the corrupted page mapped to kill.
*/
-static void collect_procs(struct page *page, struct list_head *tokill,
- int force_early)
+static void collect_procs(struct page *page, unsigned int offset,
+ short granularity, struct list_head *tokill, int force_early)
{
if (!page->mapping)
return;
@@ -522,7 +477,8 @@ static void collect_procs(struct page *page, struct list_head *tokill,
if (PageAnon(page))
collect_procs_anon(page, tokill, force_early);
else
- collect_procs_file(page, tokill, force_early);
+ collect_procs_file(page, offset, granularity, tokill,
+ force_early);
}
static const char *action_name[] = {
@@ -1026,7 +982,8 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
* there's nothing that can be done.
*/
if (kill)
- collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED);
+ collect_procs(hpage, 0, PAGE_SHIFT, &tokill,
+ flags & MF_ACTION_REQUIRED);
if (!PageHuge(hpage)) {
unmap_success = try_to_unmap(hpage, ttu);
@@ -1176,16 +1133,14 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
return res;
}
-static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
- struct dev_pagemap *pgmap)
+static int memory_failure_dev_pagemap(unsigned long addr, short granularity,
+ int flags, struct dev_pagemap *pgmap)
{
+ unsigned long pfn = addr >> PAGE_SHIFT;
struct page *page = pfn_to_page(pfn);
const bool unmap_success = true;
- unsigned long size = 0;
- struct to_kill *tk;
LIST_HEAD(tokill);
int rc = -EBUSY;
- loff_t start;
dax_entry_t cookie;
/*
@@ -1225,21 +1180,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
* SIGBUS (i.e. MF_MUST_KILL)
*/
flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
- collect_procs(page, &tokill, flags & MF_ACTION_REQUIRED);
-
- list_for_each_entry(tk, &tokill, nd)
- if (tk->size_shift)
- size = max(size, 1UL << tk->size_shift);
- if (size) {
- /*
- * Unmap the largest mapping to avoid breaking up
- * device-dax mappings which are constant size. The
- * actual size of the mapping being torn down is
- * communicated in siginfo, see kill_proc()
- */
- start = (page->index << PAGE_SHIFT) & ~(size - 1);
- unmap_mapping_range(page->mapping, start, start + size, 0);
- }
+ collect_procs(page, offset_in_page(addr), granularity, &tokill,
+ flags & MF_ACTION_REQUIRED);
kill_procs(&tokill, flags & MF_MUST_KILL, !unmap_success, pfn, flags);
rc = 0;
unlock:
@@ -1252,8 +1194,9 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
}
/**
- * memory_failure - Handle memory failure of a page.
- * @pfn: Page Number of the corrupted page
+ * __memory_failure - Handle memory failure.
+ * @addr: Physical address of the error.
+ * @granularity: Number of bits in the address which are bad.
* @flags: fine tune action taken
*
* This function is called by the low level machine check code
@@ -1268,8 +1211,9 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
* Must run in process context (e.g. a work queue) with interrupts
* enabled and no spinlocks hold.
*/
-int memory_failure(unsigned long pfn, int flags)
+int __memory_failure(unsigned long addr, unsigned int granularity, int flags)
{
+ unsigned long pfn = addr >> PAGE_SHIFT;
struct page *p;
struct page *hpage;
struct page *orig_head;
@@ -1285,8 +1229,8 @@ int memory_failure(unsigned long pfn, int flags)
if (pfn_valid(pfn)) {
pgmap = get_dev_pagemap(pfn, NULL);
if (pgmap)
- return memory_failure_dev_pagemap(pfn, flags,
- pgmap);
+ return memory_failure_dev_pagemap(addr,
+ granularity, flags, pgmap);
}
pr_err("Memory failure: %#lx: memory outside kernel control\n",
pfn);
@@ -1442,7 +1386,7 @@ int memory_failure(unsigned long pfn, int flags)
unlock_page(p);
return res;
}
-EXPORT_SYMBOL_GPL(memory_failure);
+EXPORT_SYMBOL_GPL(__memory_failure);
#define MEMORY_FAILURE_FIFO_ORDER 4
#define MEMORY_FAILURE_FIFO_SIZE (1 << MEMORY_FAILURE_FIFO_ORDER)
6 months, 3 weeks
[PATCH 0/2] powerpc/papr_scm: add support for reporting NVDIMM 'life_used_percentage' metric
by Vaibhav Jain
This small patchset implements kernel side support for reporting
'life_used_percentage' metric in NDCTL with dimm health output for
papr-scm NVDIMMs. With corresponding NDCTL side changes [1] output for
should be like:
$ sudo ndctl list -DH
[
{
"dev":"nmem0",
"health":{
"health_state":"ok",
"life_used_percentage":0,
"shutdown_state":"clean"
}
}
]
PHYP supports H_SCM_PERFORMANCE_STATS hcall through which an LPAR can
fetch various performance stats including 'fuel_gauge' percentage for
an NVDIMM. 'fuel_gauge' metric indicates the usable life remaining of
an NVDIMM expressed as percentage and 'life_used_percentage' can be
calculated as 'life_used_percentage = 100 - fuel_gauge'.
Structure of the patchset
=========================
First patch implements necessary scaffolding needed to issue the
H_SCM_PERFORMANCE_STATS hcall and fetch performance stats
catalogue. The patch also implements support for 'perf_stats' sysfs
attribute to report the full catalogue of supported performance stats
by PHYP.
Second and final patch implements support for sending this value to
libndctl by extending the PAPR_PDSM_HEALTH pdsm payload to add a new
field named 'dimm_fuel_gauge' to it.
References
==========
[1]
https://github.com/vaibhav92/ndctl/tree/papr_scm_health_v13_run_guage
Vaibhav Jain (2):
powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric
Documentation/ABI/testing/sysfs-bus-papr-pmem | 27 +++
arch/powerpc/include/uapi/asm/papr_pdsm.h | 9 +
arch/powerpc/platforms/pseries/papr_scm.c | 186 ++++++++++++++++++
3 files changed, 222 insertions(+)
--
2.26.2
6 months, 4 weeks
Re: Question on PMEM regions (Linux 4.9 Kernel & above)
by Dan Williams
On Fri, Jun 19, 2020 at 7:02 PM Ananth, Rajesh <Rajesh.Ananth(a)smartm.com> wrote:
>
> We used the Ubuntu 18.04 to get the "acpdump" outputs (This is the only complete package distribution we have. Otherwise, we use mainly the built Kernels). The NFIT data is all valid, but somehow it is printing the "@ addresss" at the beginning as zeros.
>
> ============================= acpdump -n NFIT ========================================
>
> NFIT @ 0x0000000000000000 <<<< DON'T KNOW WHY.
That's fine, acpixtract was still able to convert it... I see this
from disassembling it:
acpixtract -s NFIT nfit.txt
iasl -d nft.dat
cat nfit.dsl
[000h 0000 4] Signature : "NFIT" [NVDIMM
Firmware Interface Table]
[004h 0004 4] Table Length : 000001A4
[008h 0008 1] Revision : 01
[009h 0009 1] Checksum : 83
[00Ah 0010 6] Oem ID : "ALASKA"
[010h 0016 8] Oem Table ID : "A M I "
[018h 0024 4] Oem Revision : 00000002
[01Ch 0028 4] Asl Compiler ID : "INTL"
[020h 0032 4] Asl Compiler Revision : 20091013
[024h 0036 4] Reserved : 00000000
[028h 0040 2] Subtable Type : 0000 [System Physical
Address Range]
[02Ah 0042 2] Length : 0038
[02Ch 0044 2] Range Index : 0001
[02Eh 0046 2] Flags (decoded below) : 0002
Add/Online Operation Only : 0
Proximity Domain Valid : 1
[030h 0048 4] Reserved : 00000000
[034h 0052 4] Proximity Domain : 00000000
[038h 0056 16] Region Type GUID :
66F0D379-B4F3-4074-AC43-0D3318B78CDB
[048h 0072 8] Address Range Base : 0000004080000000
[050h 0080 8] Address Range Length : 0000000400000000
[058h 0088 8] Memory Map Attribute : 0000000000008008
[..]
[0E0h 0224 2] Subtable Type : 0000 [System Physical
Address Range]
[0E2h 0226 2] Length : 0038
[0E4h 0228 2] Range Index : 0002
[0E6h 0230 2] Flags (decoded below) : 0002
Add/Online Operation Only : 0
Proximity Domain Valid : 1
[0E8h 0232 4] Reserved : 00000000
[0ECh 0236 4] Proximity Domain : 00000000
[0F0h 0240 16] Region Type GUID :
66F0D379-B4F3-4074-AC43-0D3318B78CDB
[100h 0256 8] Address Range Base : 0000004480000000
[108h 0264 8] Address Range Length : 0000000400000000
[110h 0272 8] Memory Map Attribute : 0000000000008008
...so Linux is being handed an NFIT with 2 regions. So the 4.16
interpretation looks correct to me. Are you sure you only changed
kernel versions and did not also do a BIOS update? If not the 4.7
result looks like a bug for this NFIT.
6 months, 4 weeks
[PATCH v6 0/2] Renovate memcpy_mcsafe with copy_mc_to_{user, kernel}
by Dan Williams
No changes since v5 [1], just rebased to v5.8-rc1. No comments since
that posting back at the end of May either, will continue to re-post
weekly, I am otherwise at a loss for what else to do to move this
forward. Should it go through Andrew since it's across PPC and x86?
Thanks again to Michael for the PPC acks.
[1]: http://lore.kernel.org/r/159062136234.2192412.7285856919306307817.stgit@d...
---
The primary motivation to go touch memcpy_mcsafe() is that the existing
benefit of doing slow "handle with care" copies is obviated on newer
CPUs. With that concern lifted it also obviates the need to continue to
update the MCA-recovery capability detection code currently gated by
"mcsafe_key". Now the old "mcsafe_key" opt-in to perform the copy with
concerns for recovery fragility can instead be made an opt-out from the
default fast copy implementation (enable_copy_mc_fragile()).
The discussion with Linus on the first iteration of this patch
identified that memcpy_mcsafe() was misnamed relative to its usage. The
new names copy_mc_to_user() and copy_mc_to_kernel() clearly indicate the
intended use case and lets the architecture organize the implementation
accordingly.
For both powerpc and x86 a copy_mc_generic() implementation is added as
the backend for these interfaces.
Patches are relative to v5.8-rc1. It has a recent build success
notification from the kbuild robot and is passing local nvdimm tests.
---
Dan Williams (2):
x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user,kernel}()
x86/copy_mc: Introduce copy_mc_generic()
arch/powerpc/Kconfig | 2
arch/powerpc/include/asm/string.h | 2
arch/powerpc/include/asm/uaccess.h | 40 +++--
arch/powerpc/lib/Makefile | 2
arch/powerpc/lib/copy_mc_64.S | 4
arch/x86/Kconfig | 2
arch/x86/Kconfig.debug | 2
arch/x86/include/asm/copy_mc_test.h | 75 +++++++++
arch/x86/include/asm/mcsafe_test.h | 75 ---------
arch/x86/include/asm/string_64.h | 32 ----
arch/x86/include/asm/uaccess.h | 21 +++
arch/x86/include/asm/uaccess_64.h | 20 --
arch/x86/kernel/cpu/mce/core.c | 8 -
arch/x86/kernel/quirks.c | 9 -
arch/x86/lib/Makefile | 1
arch/x86/lib/copy_mc.c | 64 ++++++++
arch/x86/lib/copy_mc_64.S | 165 ++++++++++++++++++++
arch/x86/lib/memcpy_64.S | 115 --------------
arch/x86/lib/usercopy_64.c | 21 ---
drivers/md/dm-writecache.c | 15 +-
drivers/nvdimm/claim.c | 2
drivers/nvdimm/pmem.c | 6 -
include/linux/string.h | 9 -
include/linux/uaccess.h | 9 +
include/linux/uio.h | 10 +
lib/Kconfig | 7 +
lib/iov_iter.c | 43 +++--
tools/arch/x86/include/asm/mcsafe_test.h | 13 --
tools/arch/x86/lib/memcpy_64.S | 115 --------------
tools/objtool/check.c | 5 -
tools/perf/bench/Build | 1
tools/perf/bench/mem-memcpy-x86-64-lib.c | 24 ---
tools/testing/nvdimm/test/nfit.c | 48 +++---
.../testing/selftests/powerpc/copyloops/.gitignore | 2
tools/testing/selftests/powerpc/copyloops/Makefile | 6 -
.../selftests/powerpc/copyloops/copy_mc_64.S | 1
.../selftests/powerpc/copyloops/memcpy_mcsafe_64.S | 1
37 files changed, 451 insertions(+), 526 deletions(-)
rename arch/powerpc/lib/{memcpy_mcsafe_64.S => copy_mc_64.S} (98%)
create mode 100644 arch/x86/include/asm/copy_mc_test.h
delete mode 100644 arch/x86/include/asm/mcsafe_test.h
create mode 100644 arch/x86/lib/copy_mc.c
create mode 100644 arch/x86/lib/copy_mc_64.S
delete mode 100644 tools/arch/x86/include/asm/mcsafe_test.h
delete mode 100644 tools/perf/bench/mem-memcpy-x86-64-lib.c
create mode 120000 tools/testing/selftests/powerpc/copyloops/copy_mc_64.S
delete mode 120000 tools/testing/selftests/powerpc/copyloops/memcpy_mcsafe_64.S
base-commit: b3a9e3b9622ae10064826dccb4f7a52bd88c7407
6 months, 4 weeks
Re: [GIT PULL] General notification queue and key notifications
by Williams, Dan J
Hi David,
On Tue, 2020-06-02 at 16:55 +0100, David Howells wrote:
> Date: Tue, 02 Jun 2020 16:51:44 +0100
>
> Hi Linus,
>
> Can you pull this, please? It adds a general notification queue
> concept
> and adds an event source for keys/keyrings, such as linking and
> unlinking
> keys and changing their attributes.
[..]
This commit:
> keys: Make the KEY_NEED_* perms an enum rather than a mask
...upstream as:
8c0637e950d6 keys: Make the KEY_NEED_* perms an enum rather than a mask
...triggers a regression in the libnvdimm unit test that exercises the
encrypted keys used to store nvdimm passphrases. It results in the
below warning.
---
WARNING: CPU: 15 PID: 6276 at security/keys/permission.c:35 key_task_permission+0xd3/0x140
Modules linked in: nd_blk(OE) nfit_test(OE) device_dax(OE) ebtable_filter(E) ebtables(E) ip6table_filter(E) ip6_tables(E) kvm_intel(E) kvm(E) irqbypass(E) nd_pmem(OE) dax_pmem(OE) nd_btt(OE) dax_p
ct10dif_pclmul(E) nd_e820(OE) nfit(OE) crc32_pclmul(E) libnvdimm(OE) crc32c_intel(E) ghash_clmulni_intel(E) serio_raw(E) encrypted_keys(E) trusted(E) nfit_test_iomap(OE) tpm(E) drm(E)
CPU: 15 PID: 6276 Comm: lt-ndctl Tainted: G OE 5.7.0-rc6+ #155
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
RIP: 0010:key_task_permission+0xd3/0x140
Code: c8 21 d9 39 d9 75 25 48 83 c4 08 4c 89 e6 48 89 ef 5b 5d 41 5c 41 5d e9 1b a7 00 00 bb 01 00 00 00 83 fa 01 0f 84 68 ff ff ff <0f> 0b 48 83 c4 08 b8 f3 ff ff ff 5b 5d 41 5c 41 5d c3 83 fa 06
RSP: 0018:ffffaddc42db7c90 EFLAGS: 00010297
RAX: 0000000000000001 RBX: 0000000000000001 RCX: ffffaddc42db7c7c
RDX: 0000000000000000 RSI: ffff985e1c46e840 RDI: ffff985e3a03de01
RBP: ffff985e3a03de01 R08: 0000000000000000 R09: 5461e7bc000002a0
R10: 0000000000000004 R11: 0000000066666666 R12: ffff985e1c46e840
R13: 0000000000000000 R14: ffffaddc42db7cd8 R15: ffff985e248c6540
FS: 00007f863c18a780(0000) GS:ffff985e3bbc0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000006d3708 CR3: 0000000125a1e006 CR4: 0000000000160ee0
Call Trace:
lookup_user_key+0xeb/0x6b0
? vsscanf+0x3df/0x840
? key_validate+0x50/0x50
? key_default_cmp+0x20/0x20
nvdimm_get_user_key_payload.part.0+0x21/0x110 [libnvdimm]
nvdimm_security_store+0x67d/0xb20 [libnvdimm]
security_store+0x67/0x1a0 [libnvdimm]
kernfs_fop_write+0xcf/0x1c0
vfs_write+0xde/0x1d0
ksys_write+0x68/0xe0
do_syscall_64+0x5c/0xa0
entry_SYSCALL_64_after_hwframe+0x49/0xb3
RIP: 0033:0x7f863c624547
Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007ffd61d8f5e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007ffd61d8f640 RCX: 00007f863c624547
RDX: 0000000000000014 RSI: 00007ffd61d8f640 RDI: 0000000000000005
RBP: 0000000000000005 R08: 0000000000000014 R09: 00007ffd61d8f4a0
R10: fffffffffffff455 R11: 0000000000000246 R12: 00000000006dbbf0
R13: 00000000006cd710 R14: 00007f863c18a6a8 R15: 00007ffd61d8fae0
irq event stamp: 36976
hardirqs last enabled at (36975): [<ffffffff9131fa40>] __slab_alloc+0x70/0x90
hardirqs last disabled at (36976): [<ffffffff910049c7>] trace_hardirqs_off_thunk+0x1a/0x1c
softirqs last enabled at (35474): [<ffffffff91e00357>] __do_softirq+0x357/0x466
softirqs last disabled at (35467): [<ffffffff910eae96>] irq_exit+0xe6/0xf0
6 months, 4 weeks
[PATCH AUTOSEL 5.7 26/28] nvdimm/region: always show the 'align' attribute
by Sasha Levin
From: Vishal Verma <vishal.l.verma(a)intel.com>
[ Upstream commit 543094e19c82b5d171e139d09a1a3ea0a7361117 ]
It is possible that a platform that is capable of 'namespace labels'
comes up without the labels properly initialized. In this case, the
region's 'align' attribute is hidden. Howerver, once the user does
initialize he labels, the 'align' attribute still stays hidden, which is
unexpected.
The sysfs_update_group() API is meant to address this, and could be
called during region probe, but it has entanglements with the device
'lockdep_mutex'. Therefore, simply make the 'align' attribute always
visible. It doesn't matter what it says for label-less namespaces, since
it is not possible to change their allocation anyway.
Suggested-by: Dan Williams <dan.j.williams(a)intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma(a)intel.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Link: https://lore.kernel.org/r/20200520225026.29426-1-vishal.l.verma@intel.com
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/nvdimm/region_devs.c | 14 ++------------
1 file changed, 2 insertions(+), 12 deletions(-)
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index ccbb5b43b8b2c..4502f9c4708d0 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -679,18 +679,8 @@ static umode_t region_visible(struct kobject *kobj, struct attribute *a, int n)
return a->mode;
}
- if (a == &dev_attr_align.attr) {
- int i;
-
- for (i = 0; i < nd_region->ndr_mappings; i++) {
- struct nd_mapping *nd_mapping = &nd_region->mapping[i];
- struct nvdimm *nvdimm = nd_mapping->nvdimm;
-
- if (test_bit(NDD_LABELING, &nvdimm->flags))
- return a->mode;
- }
- return 0;
- }
+ if (a == &dev_attr_align.attr)
+ return a->mode;
if (a != &dev_attr_set_cookie.attr
&& a != &dev_attr_available_size.attr)
--
2.25.1
6 months, 4 weeks
[GIT PULL] libnvdimm for v5.8-rc2
by Dan Williams
Hi Linus, please pull from:
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
tags/libnvdimm-for-5.8-rc2
...to receive a feature (papr_scm health retrieval) and a fix (sysfs
attribute visibility) for v5.8.
Vaibhav explains in the merge commit below why missing v5.8 would be
painful and I agreed to try a -rc2 pull because only cosmetics kept
this out of -rc1 and his initial versions were posted in more than
enough time for v5.8 consideration.
===
These patches are tied to specific features that were committed to
customers in upcoming distros releases (RHEL and SLES) whose time-lines
are tied to 5.8 kernel release.
Being able to track the health of an nvdimm is critical for our
customers that are running workloads leveraging papr-scm nvdimms.
Missing the 5.8 kernel would mean missing the distro timelines and
shifting forward the availability of this feature in distro kernels by
at least 6 months.
===
I notice that these do not have an ack from Michael, but I had been
assuming that he was deferring this to a libnvdimm subsystem decision
ever since v7 back at the end of May where he said "I don't have
strong opinions about the user API, it's really up to the nvdimm
folks." [1]
This pull request includes v13 of papr_scm set, and it looks good to me.
Please consider pulling, I would not normally have broached asking for
this exception, but Vaibhav made sure to have less than 24 hour turn
around on all final review comments.
This has been in -next all week with no reported issues.
[1]: http://lore.kernel.org/r/875zcigafk.fsf@mpe.ellerman.id.au
---
The following changes since commit b3a9e3b9622ae10064826dccb4f7a52bd88c7407:
Linux 5.8-rc1 (2020-06-14 12:45:04 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
tags/libnvdimm-for-5.8-rc2
for you to fetch changes up to 9df24eaef86f5d5cb38c77eaa1cfa3eec09ebfe8:
Merge branch 'for-5.8/papr_scm' into libnvdimm-for-next (2020-06-19
14:18:51 -0700)
----------------------------------------------------------------
libnvdimm for 5.8-rc2
- Fix the visibility of the region 'align' attribute. The new unit tests
for region alignment handling caught a corner case where the alignment
cannot be specified if the region is converted from static to dynamic
provisioning at runtime.
- Add support for device health retrieval for the persistent memory
supported by the papr_scm driver. This includes both the standard
sysfs "health flags" that the nfit persistent memory driver publishes
and a mechanism for the ndctl tool to retrieve a health-command payload.
----------------------------------------------------------------
Dan Williams (1):
Merge branch 'for-5.8/papr_scm' into libnvdimm-for-next
Vaibhav Jain (6):
powerpc: Document details on H_SCM_HEALTH hcall
seq_buf: Export seq_buf_printf
powerpc/papr_scm: Fetch nvdimm health information from PHYP
powerpc/papr_scm: Improve error logging and handling papr_scm_ndctl()
ndctl/papr_scm,uapi: Add support for PAPR nvdimm specific methods
powerpc/papr_scm: Implement support for PAPR_PDSM_HEALTH
Vishal Verma (1):
nvdimm/region: always show the 'align' attribute
Documentation/ABI/testing/sysfs-bus-papr-pmem | 27 ++
Documentation/powerpc/papr_hcalls.rst | 46 ++-
arch/powerpc/include/uapi/asm/papr_pdsm.h | 132 ++++++++
arch/powerpc/platforms/pseries/papr_scm.c | 420 +++++++++++++++++++++++++-
drivers/nvdimm/region_devs.c | 14 +-
include/uapi/linux/ndctl.h | 1 +
lib/seq_buf.c | 1 +
7 files changed, 618 insertions(+), 23 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-bus-papr-pmem
create mode 100644 arch/powerpc/include/uapi/asm/papr_pdsm.h
7 months