[x86/copy_mc] fb406088ce: fio.read_iops -55.3% regression
by kernel test robot
Greeting,
FYI, we noticed a -55.3% regression of fio.read_iops due to commit:
commit: fb406088ce0e36122cff0ffeed823023074c7dc6 ("x86/copy_mc: Introduce copy_mc_generic()")
https://git.kernel.org/cgit/linux/kernel/git/nvdimm/nvdimm.git for-5.9/copy_mc
in testcase: fio-basic
on test machine: 96 threads Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 256G memory
with following parameters:
disk: 2pmem
fs: xfs
mount_option: dax
runtime: 200s
nr_task: 50%
time_based: tb
rw: randread
bs: 2M
ioengine: sync
test_size: 200G
cpufreq_governor: performance
ucode: 0x5002f01
test-description: Fio is a tool that will spawn a number of threads or processes doing a particular type of I/O action as specified by the user.
test-url: https://github.com/axboe/fio
If you fix the issue, kindly add following tag
Reported-by: kernel test robot <rong.a.chen(a)intel.com>
Details are as below:
-------------------------------------------------------------------------------------------------->
To reproduce:
git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml
=========================================================================================
bs/compiler/cpufreq_governor/disk/fs/ioengine/kconfig/mount_option/nr_task/rootfs/runtime/rw/tbox_group/test_size/testcase/time_based/ucode:
2M/gcc-9/performance/2pmem/xfs/sync/x86_64-rhel-8.3/dax/50%/debian-10.4-x86_64-20200603.cgz/200s/randread/lkp-csl-2sp6/200G/fio-basic/tb/0x5002f01
commit:
0a78de3d4b ("x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()")
fb406088ce ("x86/copy_mc: Introduce copy_mc_generic()")
0a78de3d4b7b1b80 fb406088ce0e36122cff0ffeed8
---------------- ---------------------------
%stddev %change %stddev
\ | \
0.44 ± 31% -0.3 0.17 ± 50% fio.latency_1000us%
0.02 ± 65% +1.3 1.31 ± 20% fio.latency_10ms%
0.00 ±173% +0.0 0.03 ± 59% fio.latency_20ms%
97.37 -96.7 0.72 ± 74% fio.latency_2ms%
0.99 ± 10% +96.5 97.44 fio.latency_4ms%
0.56 ± 58% -0.5 0.03 ±100% fio.latency_500us%
74412 -55.3% 33285 fio.read_bw_MBps
1376256 +118.5% 3006464 fio.read_clat_90%_us
1400832 +118.1% 3055616 fio.read_clat_95%_us
1980416 ± 12% +160.6% 5160960 ± 6% fio.read_clat_99%_us
1282194 +124.0% 2872613 fio.read_clat_mean_us
207458 ± 6% +127.8% 472559 ± 7% fio.read_clat_stddev
37206 -55.3% 16642 fio.read_iops
80.95 ± 2% -38.8% 49.50 ± 12% fio.time.user_time
21418 -1.3% 21134 fio.time.voluntary_context_switches
7441285 -55.3% 3328617 fio.workload
30156 ± 4% -24.0% 22920 ± 6% cpuidle.C1.usage
1675 -4.2% 1604 vmstat.system.cs
0.11 ± 3% +0.0 0.14 ± 4% mpstat.cpu.all.soft%
0.51 ± 7% -0.2 0.32 ± 10% mpstat.cpu.all.usr%
114802 -1.7% 112839 proc-vmstat.nr_shmem
20196 ± 5% -21.1% 15925 ± 10% proc-vmstat.pgactivate
63.47 ± 11% -22.2 41.28 ± 10% perf-profile.calltrace.cycles-pp.copy_mc_fragile.copy_mc_to_user.copyout_mc._copy_mc_to_iter.dax_iomap_actor
0.00 +6.6 6.59 ± 27% perf-profile.calltrace.cycles-pp.copy_mc_generic.copy_mc_to_user.copyout_mc._copy_mc_to_iter.dax_iomap_actor
0.00 +31.2 31.16 ± 7% perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.copy_mc_generic.copy_mc_to_user.copyout_mc._copy_mc_to_iter
63.59 ± 11% -22.2 41.37 ± 10% perf-profile.children.cycles-pp.copy_mc_fragile
1.54 ± 73% -1.0 0.58 ± 44% perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
1.51 ± 74% -1.0 0.56 ± 42% perf-profile.children.cycles-pp.hrtimer_interrupt
0.30 ±112% -0.2 0.05 ± 60% perf-profile.children.cycles-pp.clockevents_program_event
2.07 ± 79% +14.4 16.48 ± 9% perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
0.00 +22.0 22.04 ± 4% perf-profile.children.cycles-pp.copy_mc_generic
62.62 ± 11% -21.6 40.98 ± 10% perf-profile.self.cycles-pp.copy_mc_fragile
0.31 ±123% -0.3 0.03 ±100% perf-profile.self.cycles-pp.ktime_get
0.00 +21.7 21.73 ± 4% perf-profile.self.cycles-pp.copy_mc_generic
188.70 ± 17% +223.4% 610.19 ± 64% sched_debug.cfs_rq:/.exec_clock.min
33475 ± 4% -18.2% 27395 ± 5% sched_debug.cfs_rq:/.exec_clock.stddev
36013 ± 3% -15.7% 30367 ± 3% sched_debug.cfs_rq:/.min_vruntime.stddev
36013 ± 3% -15.7% 30372 ± 3% sched_debug.cfs_rq:/.spread0.stddev
9.33 ± 2% +38.6% 12.94 sched_debug.cpu.clock.stddev
3468 ± 30% -26.4% 2552 ± 7% sched_debug.cpu.nr_switches.stddev
366.94 ± 5% +23.2% 451.94 ± 7% sched_debug.cpu.sched_count.min
2255 ± 8% -24.3% 1706 ± 9% sched_debug.cpu.sched_count.stddev
1145 ± 8% -22.8% 884.59 ± 8% sched_debug.cpu.sched_goidle.stddev
9875 ± 7% -47.4% 5196 ± 18% sched_debug.cpu.ttwu_count.max
1295 ± 5% -32.5% 874.15 ± 11% sched_debug.cpu.ttwu_count.stddev
734.16 ± 6% -16.5% 613.31 ± 5% sched_debug.cpu.ttwu_local.stddev
7840 ± 81% +126.3% 17742 ± 39% softirqs.CPU10.SCHED
24923 ± 11% -36.6% 15811 ± 29% softirqs.CPU12.SCHED
24257 ± 17% -51.8% 11689 ± 48% softirqs.CPU18.SCHED
26470 ± 2% -63.9% 9562 ± 59% softirqs.CPU2.SCHED
23842 ± 20% -50.6% 11786 ± 57% softirqs.CPU20.SCHED
20162 ± 36% -67.9% 6472 ± 74% softirqs.CPU21.SCHED
26425 ± 2% -31.0% 18224 ± 50% softirqs.CPU23.SCHED
13228 ±107% -67.5% 4294 ± 14% softirqs.CPU50.RCU
94827 ± 31% -23.0% 73031 ± 5% softirqs.CPU6.TIMER
5223 ± 62% +162.0% 13683 ± 35% softirqs.CPU60.SCHED
5924 ± 92% +148.7% 14736 ± 35% softirqs.CPU65.SCHED
5072 ± 85% +265.7% 18553 ± 31% softirqs.CPU66.SCHED
6035 ± 87% +191.7% 17605 ± 38% softirqs.CPU68.SCHED
8816 ± 82% +169.1% 23722 ± 18% softirqs.CPU69.SCHED
5842 ± 41% -35.7% 3758 ± 4% softirqs.CPU81.RCU
53.50 ± 31% +77.1% 94.75 ± 36% interrupts.CPU12.RES:Rescheduling_interrupts
3554 ± 33% +42.5% 5066 ± 24% interrupts.CPU18.NMI:Non-maskable_interrupts
3554 ± 33% +42.5% 5066 ± 24% interrupts.CPU18.PMI:Performance_monitoring_interrupts
40.25 ± 91% +243.5% 138.25 ± 11% interrupts.CPU18.RES:Rescheduling_interrupts
7150 ± 11% -42.4% 4121 ± 31% interrupts.CPU19.NMI:Non-maskable_interrupts
7150 ± 11% -42.4% 4121 ± 31% interrupts.CPU19.PMI:Performance_monitoring_interrupts
29.50 ± 48% +422.0% 154.00 ± 20% interrupts.CPU2.RES:Rescheduling_interrupts
69.00 ± 64% +153.6% 175.00 ± 4% interrupts.CPU21.RES:Rescheduling_interrupts
437.50 ± 2% +58.6% 694.00 ± 19% interrupts.CPU22.CAL:Function_call_interrupts
42.50 ± 38% +364.7% 197.50 ± 85% interrupts.CPU35.TLB:TLB_shootdowns
7586 -49.5% 3828 ± 41% interrupts.CPU40.NMI:Non-maskable_interrupts
7586 -49.5% 3828 ± 41% interrupts.CPU40.PMI:Performance_monitoring_interrupts
7126 ± 11% -48.2% 3692 ± 35% interrupts.CPU44.NMI:Non-maskable_interrupts
7126 ± 11% -48.2% 3692 ± 35% interrupts.CPU44.PMI:Performance_monitoring_interrupts
157.75 ± 12% -23.1% 121.25 ± 33% interrupts.CPU46.RES:Rescheduling_interrupts
7205 ± 9% -19.7% 5788 interrupts.CPU47.NMI:Non-maskable_interrupts
7205 ± 9% -19.7% 5788 interrupts.CPU47.PMI:Performance_monitoring_interrupts
7615 -42.5% 4379 ± 34% interrupts.CPU48.NMI:Non-maskable_interrupts
7615 -42.5% 4379 ± 34% interrupts.CPU48.PMI:Performance_monitoring_interrupts
6642 ± 25% -53.8% 3072 ± 11% interrupts.CPU50.NMI:Non-maskable_interrupts
6642 ± 25% -53.8% 3072 ± 11% interrupts.CPU50.PMI:Performance_monitoring_interrupts
182.00 ± 5% -47.8% 95.00 ± 44% interrupts.CPU50.RES:Rescheduling_interrupts
80.75 ± 17% +40.9% 113.75 ± 6% interrupts.CPU51.RES:Rescheduling_interrupts
7619 -47.7% 3981 ± 27% interrupts.CPU57.NMI:Non-maskable_interrupts
7619 -47.7% 3981 ± 27% interrupts.CPU57.PMI:Performance_monitoring_interrupts
164.00 ± 12% -20.9% 129.75 ± 23% interrupts.CPU57.RES:Rescheduling_interrupts
7139 ± 11% -51.2% 3483 ± 53% interrupts.CPU62.NMI:Non-maskable_interrupts
7139 ± 11% -51.2% 3483 ± 53% interrupts.CPU62.PMI:Performance_monitoring_interrupts
6644 ± 24% -54.1% 3048 ± 5% interrupts.CPU66.NMI:Non-maskable_interrupts
6644 ± 24% -54.1% 3048 ± 5% interrupts.CPU66.PMI:Performance_monitoring_interrupts
174.25 ± 19% -51.5% 84.50 ± 53% interrupts.CPU66.RES:Rescheduling_interrupts
179.00 ± 3% -49.2% 91.00 ± 49% interrupts.CPU68.RES:Rescheduling_interrupts
6938 ± 11% -53.1% 3255 ± 48% interrupts.CPU69.NMI:Non-maskable_interrupts
6938 ± 11% -53.1% 3255 ± 48% interrupts.CPU69.PMI:Performance_monitoring_interrupts
6530 ± 16% -45.9% 3531 ± 43% interrupts.CPU72.NMI:Non-maskable_interrupts
6530 ± 16% -45.9% 3531 ± 43% interrupts.CPU72.PMI:Performance_monitoring_interrupts
5519 ± 27% -33.9% 3645 ± 31% interrupts.CPU91.NMI:Non-maskable_interrupts
5519 ± 27% -33.9% 3645 ± 31% interrupts.CPU91.PMI:Performance_monitoring_interrupts
518479 ± 11% -20.2% 413954 ± 15% interrupts.NMI:Non-maskable_interrupts
518479 ± 11% -20.2% 413954 ± 15% interrupts.PMI:Performance_monitoring_interrupts
42.36 +68.2% 71.24 perf-stat.i.MPKI
9.977e+09 ± 2% -54.9% 4.503e+09 perf-stat.i.branch-instructions
0.05 ± 4% +0.0 0.08 perf-stat.i.branch-miss-rate%
3907318 -12.5% 3419750 perf-stat.i.branch-misses
67.59 +10.0 77.64 perf-stat.i.cache-miss-rate%
1.722e+09 -13.1% 1.497e+09 perf-stat.i.cache-misses
2.539e+09 ± 2% -24.4% 1.92e+09 perf-stat.i.cache-references
1658 -5.6% 1565 perf-stat.i.context-switches
2.27 +120.1% 5.00 perf-stat.i.cpi
98.73 -1.4% 97.38 perf-stat.i.cpu-migrations
87.68 +12.0% 98.23 perf-stat.i.cycles-between-cache-misses
1.003e+10 -54.9% 4.525e+09 perf-stat.i.dTLB-loads
0.00 ± 14% +0.0 0.00 ± 10% perf-stat.i.dTLB-store-miss-rate%
9.922e+09 ± 2% -55.1% 4.454e+09 perf-stat.i.dTLB-stores
45.47 +4.5 49.94 ± 2% perf-stat.i.iTLB-load-miss-rate%
2640557 ± 2% -13.6% 2280694 ± 3% perf-stat.i.iTLB-load-misses
3175197 -28.0% 2286190 perf-stat.i.iTLB-loads
5.964e+10 ± 2% -55.0% 2.682e+10 perf-stat.i.instructions
22561 -47.8% 11788 ± 4% perf-stat.i.instructions-per-iTLB-miss
0.44 -54.4% 0.20 perf-stat.i.ipc
339.51 -50.7% 167.36 perf-stat.i.metric.M/sec
1.352e+08 ± 10% +39.4% 1.885e+08 ± 10% perf-stat.i.node-load-misses
1.1e+08 ± 10% +70.1% 1.871e+08 ± 10% perf-stat.i.node-loads
2.496e+08 +15.5% 2.884e+08 perf-stat.i.node-stores
42.58 +68.2% 71.63 perf-stat.overall.MPKI
0.04 +0.0 0.08 perf-stat.overall.branch-miss-rate%
67.82 +10.1 77.94 perf-stat.overall.cache-miss-rate%
2.26 +121.0% 5.00 perf-stat.overall.cpi
78.40 +14.3% 89.61 perf-stat.overall.cycles-between-cache-misses
0.00 ± 21% +0.0 0.00 ± 10% perf-stat.overall.dTLB-load-miss-rate%
0.00 ± 18% +0.0 0.00 ± 16% perf-stat.overall.dTLB-store-miss-rate%
45.42 +4.5 49.93 ± 2% perf-stat.overall.iTLB-load-miss-rate%
22604 -47.8% 11788 ± 4% perf-stat.overall.instructions-per-iTLB-miss
0.44 -54.7% 0.20 perf-stat.overall.ipc
1588901 +1.9% 1619160 perf-stat.overall.path-length
9.822e+09 -54.4% 4.481e+09 perf-stat.ps.branch-instructions
3831180 -11.7% 3383789 perf-stat.ps.branch-misses
1.695e+09 -12.1% 1.49e+09 perf-stat.ps.cache-misses
2.5e+09 -23.5% 1.911e+09 perf-stat.ps.cache-references
1615 -4.2% 1548 perf-stat.ps.context-switches
9.878e+09 -54.4% 4.502e+09 perf-stat.ps.dTLB-loads
9.768e+09 -54.6% 4.433e+09 perf-stat.ps.dTLB-stores
2598003 ± 2% -12.7% 2267198 ± 3% perf-stat.ps.iTLB-load-misses
3121604 -27.2% 2272234 perf-stat.ps.iTLB-loads
5.871e+10 -54.5% 2.668e+10 perf-stat.ps.instructions
1.331e+08 ± 9% +40.9% 1.876e+08 ± 10% perf-stat.ps.node-load-misses
1.081e+08 ± 10% +72.2% 1.862e+08 ± 10% perf-stat.ps.node-loads
2.452e+08 +17.0% 2.868e+08 perf-stat.ps.node-stores
1.182e+13 -54.4% 5.39e+12 perf-stat.total.instructions
fio.read_bw_MBps
80000 +-------------------------------------------------------------------+
75000 |.. .+..+.+..+.. .+..+..+.+..+.+..+.. .+.. .+.. .+.. .+. .+.+..|
| + + + +.+. + +. +. |
70000 |-+ |
65000 |-+ |
| |
60000 |-+ |
55000 |-+ |
50000 |-+ |
| |
45000 |-+ |
40000 |-+ |
| |
35000 |-+O O O O O O O O O O O O O O O O |
30000 +-------------------------------------------------------------------+
fio.read_iops
40000 +-------------------------------------------------------------------+
|.. .+..+.+..+.. .+..+..+.+..+.+..+.. .+.. .+.. .+.. .+. .+.+..|
| + + + +.+. + +. +. |
35000 |-+ |
| |
| |
30000 |-+ |
| |
25000 |-+ |
| |
| |
20000 |-+ |
| |
| O O O O O O O O O O O O O O O O |
15000 +-------------------------------------------------------------------+
fio.read_clat_mean_us
3e+06 +-----------------------------------------------------------------+
| O O O O O O O O O O O O O O O O |
2.8e+06 |-+ |
2.6e+06 |-+ |
| |
2.4e+06 |-+ |
2.2e+06 |-+ |
| |
2e+06 |-+ |
1.8e+06 |-+ |
| |
1.6e+06 |-+ |
1.4e+06 |-+ |
| .+.+..+. .+.+..+.+..+.+.. .+..+..+.+..+.+..+.+.. .+.. .+..+.+..|
1.2e+06 +-----------------------------------------------------------------+
fio.read_clat_90__us
3.2e+06 +-----------------------------------------------------------------+
3e+06 |-+O O O O O O O O O O O O O O O O |
| |
2.8e+06 |-+ |
2.6e+06 |-+ |
| |
2.4e+06 |-+ |
2.2e+06 |-+ |
2e+06 |-+ |
| |
1.8e+06 |-+ |
1.6e+06 |-+ |
| |
1.4e+06 |..+.+..+.+..+.+..+.+..+.+..+.+..+..+.+..+.+..+.+..+.+..+.+..+.+..|
1.2e+06 +-----------------------------------------------------------------+
fio.read_clat_95__us
3.2e+06 +-----------------------------------------------------------------+
3e+06 |-+O O O O O O O O O O O O O O O O |
| |
2.8e+06 |-+ |
2.6e+06 |-+ |
| |
2.4e+06 |-+ |
2.2e+06 |-+ |
2e+06 |-+ |
| |
1.8e+06 |-+ |
1.6e+06 |-+ |
| .+.. .+.. .+. .+. .+.+..+.+..+.+.. |
1.4e+06 |..+.+..+.+..+.+..+ +.+..+ +. +. +. +.+..|
1.2e+06 +-----------------------------------------------------------------+
fio.read_clat_99__us
6.5e+06 +-----------------------------------------------------------------+
6e+06 |-+ O |
| |
5.5e+06 |-+ O O O O O |
5e+06 |-+O O O O O |
| O O O O |
4.5e+06 |-+ O |
4e+06 |-+ |
3.5e+06 |-+ |
| |
3e+06 |-+ |
2.5e+06 |-+ +.. .+..+.+.. .+.. .+.+.. +.+..+.+..+ |
| + + +.+..+ +. +. .. + .|
2e+06 |-.+.+..+ + +..+.+. |
1.5e+06 +-----------------------------------------------------------------+
fio.latency_2ms_
100 +---------------------------------------------------------------------+
90 |-+ + +. +. +. |
| |
80 |-+ |
70 |-+ |
| |
60 |-+ |
50 |-+ |
40 |-+ |
| |
30 |-+ |
20 |-+ |
| |
10 |-+ |
0 +---------------------------------------------------------------------+
fio.latency_4ms_
100 +---------------------------------------------------------------------+
90 |-+ O O O |
| |
80 |-+ |
70 |-+ |
| |
60 |-+ |
50 |-+ |
40 |-+ |
| |
30 |-+ |
20 |-+ |
| |
10 |-+ |
0 +---------------------------------------------------------------------+
fio.latency_10ms_
2.5 +---------------------------------------------------------------------+
| O |
| |
2 |-+ |
| O O |
| |
1.5 |-+ |
| O O O O O O O O O |
1 |-+ O O O O |
| |
| |
0.5 |-+ |
| |
| |
0 +---------------------------------------------------------------------+
fio.workload
8e+06 +-----------------------------------------------------------------+
7.5e+06 |.. .+..+.+..+. .+.+..+.+..+.+..+.. .+.. .+. .+. .+. .+.+..|
| + +. + +.+. +. +. +. |
7e+06 |-+ |
6.5e+06 |-+ |
| |
6e+06 |-+ |
5.5e+06 |-+ |
5e+06 |-+ |
| |
4.5e+06 |-+ |
4e+06 |-+ |
| |
3.5e+06 |-+O O O O O O O O O O O O O O O O |
3e+06 +-----------------------------------------------------------------+
fio.time.user_time
95 +----------------------------------------------------------------------+
90 |-+ + |
| +.. + + .+.. +.. + .+.. .+.. |
85 |.. + +.. + +. + .+..+.+.. .. + .+. + +.. .+. |
80 |-++ + +..+ +. + +. +. +..|
75 |-+ |
70 |-+ |
| |
65 |-+ |
60 |-+ O O |
55 |-+ |
50 |-+O O O O O |
| O O O O O |
45 |-+ O O O O |
40 +----------------------------------------------------------------------+
[*] bisect-good sample
[O] bisect-bad sample
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
Thanks,
Rong Chen
1 year, 9 months
[PATCH v3 00/23] device-dax: Support sub-dividing soft-reserved
ranges
by Dan Williams
Changes since v2 [1]:
- Rebase on next/master to resolve conflicts with pending mem-hotplug
and memremap_pages() changes in -mm
- Drop attempt at a generic phys_to_target_node() implementation and
just follow the default fallback approach taken with
memory_add_physaddr_to_nid() (Mike)
- Fix test_hmm and other compilation fixups (Ralph)
- Integrate Joao's extensions to the device-dax sub-division interface
(per-device align, user-directed extent allocation). (Joao)
[1]: http://lore.kernel.org/r/159457116473.754248.7879464730875147365.stgit@dw...
---
Merge notes:
Andrew, this series is rebased on today's next/master to resolve
conflicts with some pending patches in -mm. I'd like to take it through
your tree given the intersections with memremap_pages() and memory
hotplug. If at all possible I'd like to see it in v5.10, but I realize
time is short. Outside of the Intel identified use cases for this Joao
has identified a use case for Oracle as well.
I would have sent this earlier save for the fact I am mostly offline
tending to a newborn these days. Vishal has stepped up to take on care
and feeding of this patchset if additional review / integration fixups
are needed.
The one test feedback this wants is from Justin (justin.he(a)arm.com), and
whether this lights up dax_kmem and now dax_hmem for him on arm64.
Otherwise, Joao has written unit tests for this in his enabling of the
daxctl userspace utility [2].
---
Cover:
The device-dax facility allows an address range to be directly mapped
through a chardev, or optionally hotplugged to the core kernel page
allocator as System-RAM. It is the mechanism for converting persistent
memory (pmem) to be used as another volatile memory pool i.e. the
current Memory Tiering hot topic on linux-mm.
In the case of pmem the nvdimm-namespace-label mechanism can sub-divide
it, but that labeling mechanism is not available / applicable to
soft-reserved ("EFI specific purpose") memory [3]. This series provides
a sysfs-mechanism for the daxctl utility to enable provisioning of
volatile-soft-reserved memory ranges.
The motivations for this facility are:
1/ Allow performance differentiated memory ranges to be split between
kernel-managed and directly-accessed use cases.
2/ Allow physical memory to be provisioned along performance relevant
address boundaries. For example, divide a memory-side cache [4] along
cache-color boundaries.
3/ Parcel out soft-reserved memory to VMs using device-dax as a security
/ permissions boundary [5]. Specifically I have seen people (ab)using
memmap=nn!ss (mark System-RAM as Persistent Memory) just to get the
device-dax interface on custom address ranges. A follow-on for the VM
use case is to teach device-dax to dynamically allocate 'struct page' at
runtime to reduce the duplication of 'struct page' space in both the
guest and the host kernel for the same physical pages.
[2]: http://lore.kernel.org/r/20200713160837.13774-11-joao.m.martins@oracle.com
[3]: http://lore.kernel.org/r/157309097008.1579826.12818463304589384434.stgit@...
[4]: http://lore.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@...
[5]: http://lore.kernel.org/r/20200110190313.17144-1-joao.m.martins@oracle.com
---
Dan Williams (19):
x86/numa: Cleanup configuration dependent command-line options
x86/numa: Add 'nohmat' option
efi/fake_mem: Arrange for a resource entry per efi_fake_mem instance
ACPI: HMAT: Refactor hmat_register_target_device to hmem_register_device
resource: Report parent to walk_iomem_res_desc() callback
mm/memory_hotplug: Introduce default phys_to_target_node() implementation
ACPI: HMAT: Attach a device for each soft-reserved range
device-dax: Drop the dax_region.pfn_flags attribute
device-dax: Move instance creation parameters to 'struct dev_dax_data'
device-dax: Make pgmap optional for instance creation
device-dax: Kill dax_kmem_res
device-dax: Add an allocation interface for device-dax instances
device-dax: Introduce 'seed' devices
drivers/base: Make device_find_child_by_name() compatible with sysfs inputs
device-dax: Add resize support
mm/memremap_pages: Convert to 'struct range'
mm/memremap_pages: Support multiple ranges per invocation
device-dax: Add dis-contiguous resource support
device-dax: Introduce 'mapping' devices
Joao Martins (4):
device-dax: Make align a per-device property
device-dax: Add an 'align' attribute
dax/hmem: Introduce dax_hmem.region_idle parameter
device-dax: Add a range mapping allocation attribute
arch/powerpc/kvm/book3s_hv_uvmem.c | 14
arch/x86/include/asm/numa.h | 8
arch/x86/kernel/e820.c | 16
arch/x86/mm/numa.c | 11
arch/x86/mm/numa_emulation.c | 3
arch/x86/xen/enlighten_pv.c | 2
drivers/acpi/numa/hmat.c | 76 --
drivers/acpi/numa/srat.c | 9
drivers/base/core.c | 2
drivers/dax/Kconfig | 4
drivers/dax/Makefile | 3
drivers/dax/bus.c | 1055 ++++++++++++++++++++++++++++++--
drivers/dax/bus.h | 28 +
drivers/dax/dax-private.h | 40 +
drivers/dax/device.c | 132 ++--
drivers/dax/hmem.c | 56 --
drivers/dax/hmem/Makefile | 6
drivers/dax/hmem/device.c | 100 +++
drivers/dax/hmem/hmem.c | 65 ++
drivers/dax/kmem.c | 199 +++---
drivers/dax/pmem/compat.c | 2
drivers/dax/pmem/core.c | 22 -
drivers/firmware/efi/x86_fake_mem.c | 12
drivers/gpu/drm/nouveau/nouveau_dmem.c | 15
drivers/nvdimm/badrange.c | 26 -
drivers/nvdimm/claim.c | 13
drivers/nvdimm/nd.h | 3
drivers/nvdimm/pfn_devs.c | 13
drivers/nvdimm/pmem.c | 27 -
drivers/nvdimm/region.c | 21 -
drivers/pci/p2pdma.c | 12
include/acpi/acpi_numa.h | 14
include/linux/dax.h | 8
include/linux/memory_hotplug.h | 5
include/linux/memremap.h | 11
include/linux/range.h | 6
kernel/resource.c | 11
lib/test_hmm.c | 15
mm/memory_hotplug.c | 10
mm/memremap.c | 299 +++++----
tools/testing/nvdimm/dax-dev.c | 22 -
tools/testing/nvdimm/test/iomap.c | 2
42 files changed, 1810 insertions(+), 588 deletions(-)
delete mode 100644 drivers/dax/hmem.c
create mode 100644 drivers/dax/hmem/Makefile
create mode 100644 drivers/dax/hmem/device.c
create mode 100644 drivers/dax/hmem/hmem.c
base-commit: 01830e6c042e8eb6eb202e05d7df8057135b4c26
1 year, 9 months
[PATCH 1/2] libnvdimm/security: 'security' attr never show 'overwrite' state
by Jane Chu
Since
commit d78c620a2e82 ("libnvdimm/security: Introduce a 'frozen' attribute"),
when issue
# ndctl sanitize-dimm nmem0 --overwrite
then immediately check the 'security' attribute,
# cat /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/nmem0/security
unlocked
Actually the attribute stays 'unlocked' through out the entire overwrite
operation, never changed. That's because 'nvdimm->sec.flags' is a bitmap
that has both bits set indicating 'overwrite' and 'unlocked'.
But security_show() checks the mutually exclusive bits before it checks
the 'overwrite' bit at last. The order should be reversed.
The commit also has a typo: in one occasion, 'nvdimm->sec.ext_state'
assignment is replaced with 'nvdimm->sec.flags' assignment for
the NVDIMM_MASTER type.
Cc: Dan Williams <dan.j.williams(a)intel.com>
Fixes: d78c620a2e82 ("libnvdimm/security: Introduce a 'frozen' attribute")
Signed-off-by: Jane Chu <jane.chu(a)oracle.com>
---
drivers/nvdimm/dimm_devs.c | 4 ++--
drivers/nvdimm/security.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index b7b77e8..5d72026 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -363,14 +363,14 @@ __weak ssize_t security_show(struct device *dev,
{
struct nvdimm *nvdimm = to_nvdimm(dev);
+ if (test_bit(NVDIMM_SECURITY_OVERWRITE, &nvdimm->sec.flags))
+ return sprintf(buf, "overwrite\n");
if (test_bit(NVDIMM_SECURITY_DISABLED, &nvdimm->sec.flags))
return sprintf(buf, "disabled\n");
if (test_bit(NVDIMM_SECURITY_UNLOCKED, &nvdimm->sec.flags))
return sprintf(buf, "unlocked\n");
if (test_bit(NVDIMM_SECURITY_LOCKED, &nvdimm->sec.flags))
return sprintf(buf, "locked\n");
- if (test_bit(NVDIMM_SECURITY_OVERWRITE, &nvdimm->sec.flags))
- return sprintf(buf, "overwrite\n");
return -ENOTTY;
}
diff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c
index 4cef69b..8f3971c 100644
--- a/drivers/nvdimm/security.c
+++ b/drivers/nvdimm/security.c
@@ -457,7 +457,7 @@ void __nvdimm_security_overwrite_query(struct nvdimm *nvdimm)
clear_bit(NDD_WORK_PENDING, &nvdimm->flags);
put_device(&nvdimm->dev);
nvdimm->sec.flags = nvdimm_security_flags(nvdimm, NVDIMM_USER);
- nvdimm->sec.flags = nvdimm_security_flags(nvdimm, NVDIMM_MASTER);
+ nvdimm->sec.ext_flags = nvdimm_security_flags(nvdimm, NVDIMM_MASTER);
}
void nvdimm_security_overwrite_query(struct work_struct *work)
--
1.8.3.1
1 year, 9 months
[PATCH v4 0/2] powerpc/papr_scm: add support for reporting NVDIMM 'life_used_percentage' metric
by Vaibhav Jain
Changes since v3[1]:
* Fixed a rebase issue pointed out by Aneesh in first patch in the series.
[1] https://lore.kernel.org/linux-nvdimm/20200730121303.134230-1-vaibhav@linu...
---
This small patchset implements kernel side support for reporting
'life_used_percentage' metric in NDCTL with dimm health output for
papr-scm NVDIMMs. With corresponding NDCTL side changes output for
should be like:
$ sudo ndctl list -DH
[
{
"dev":"nmem0",
"health":{
"health_state":"ok",
"life_used_percentage":0,
"shutdown_state":"clean"
}
}
]
PHYP supports H_SCM_PERFORMANCE_STATS hcall through which an LPAR can
fetch various performance stats including 'fuel_gauge' percentage for
an NVDIMM. 'fuel_gauge' metric indicates the usable life remaining of
an NVDIMM expressed as percentage and 'life_used_percentage' can be
calculated as 'life_used_percentage = 100 - fuel_gauge'.
Structure of the patchset
=========================
First patch implements necessary scaffolding needed to issue the
H_SCM_PERFORMANCE_STATS hcall and fetch performance stats
catalogue. The patch also implements support for 'perf_stats' sysfs
attribute to report the full catalogue of supported performance stats
by PHYP.
Second and final patch implements support for sending this value to
libndctl by extending the PAPR_PDSM_HEALTH pdsm payload to add a new
field named 'dimm_fuel_gauge' to it.
Vaibhav Jain (2):
powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric
Documentation/ABI/testing/sysfs-bus-papr-pmem | 27 +++
arch/powerpc/include/uapi/asm/papr_pdsm.h | 9 +
arch/powerpc/platforms/pseries/papr_scm.c | 199 ++++++++++++++++++
3 files changed, 235 insertions(+)
--
2.26.2
1 year, 9 months
[PATCH] ACPI: NFIT: Fix ARS zero-sized allocation
by Dan Williams
Pending commit in -next "devres: handle zero size in devm_kmalloc()"
triggers a boot regression due to the ARS implementation expecting NULL
from a zero-sized allocation. Avoid the zero-sized allocation by
skipping ARS, otherwise crashes with the following signature when
de-referencing ZERO_SIZE_PTR.
BUG: kernel NULL pointer dereference, address: 0000000000000018
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
RIP: 0010:__acpi_nfit_scrub+0x28a/0x350 [nfit]
[..]
Call Trace:
? acpi_nfit_query_poison+0x6a/0x180 [nfit]
acpi_nfit_scrub+0x36/0xb0 [nfit]
process_one_work+0x23c/0x580
worker_thread+0x50/0x3b0
Otherwise the implementation correctly aborts when NULL is returned from
devm_kzalloc() in ars_status_alloc().
Cc: Vishal Verma <vishal.l.verma(a)intel.com>
Cc: Dave Jiang <dave.jiang(a)intel.com>
Cc: Ira Weiny <ira.weiny(a)intel.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
drivers/acpi/nfit/core.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index fb775b967c52..26dd208a0d63 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -3334,7 +3334,7 @@ static void acpi_nfit_init_ars(struct acpi_nfit_desc *acpi_desc,
static int acpi_nfit_register_regions(struct acpi_nfit_desc *acpi_desc)
{
struct nfit_spa *nfit_spa;
- int rc;
+ int rc, do_sched_ars = 0;
set_bit(ARS_VALID, &acpi_desc->scrub_flags);
list_for_each_entry(nfit_spa, &acpi_desc->spas, list) {
@@ -3346,7 +3346,7 @@ static int acpi_nfit_register_regions(struct acpi_nfit_desc *acpi_desc)
}
}
- list_for_each_entry(nfit_spa, &acpi_desc->spas, list)
+ list_for_each_entry(nfit_spa, &acpi_desc->spas, list) {
switch (nfit_spa_type(nfit_spa->spa)) {
case NFIT_SPA_VOLATILE:
case NFIT_SPA_PM:
@@ -3354,6 +3354,13 @@ static int acpi_nfit_register_regions(struct acpi_nfit_desc *acpi_desc)
rc = ars_register(acpi_desc, nfit_spa);
if (rc)
return rc;
+
+ /*
+ * Kick off background ARS if at least one
+ * region successfully registered ARS
+ */
+ if (!test_bit(ARS_FAILED, &nfit_spa->ars_state))
+ do_sched_ars++;
break;
case NFIT_SPA_BDW:
/* nothing to register */
@@ -3372,8 +3379,10 @@ static int acpi_nfit_register_regions(struct acpi_nfit_desc *acpi_desc)
/* don't register unknown regions */
break;
}
+ }
- sched_ars(acpi_desc);
+ if (do_sched_ars)
+ sched_ars(acpi_desc);
return 0;
}
1 year, 9 months