Greeting,
FYI, we noticed a 3.8% improvement of will-it-scale.per_thread_ops due to commit:
commit: de87ae29269664b890e5323ff9649ab7990960cc ("x86/pti/64: Remove the SYSCALL64
entry trampoline")
git://internal_merge_and_test_tree devel-catchup-201808280250
in testcase: will-it-scale
on test machine: 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 64G memory
with following parameters:
nr_task: 16
mode: thread
test: futex3
cpufreq_governor: performance
test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel
copies to see if the testcase will scale. It builds both a process and threads based test
in order to see any differences between the two.
test-url:
https://github.com/antonblanchard/will-it-scale
In addition to that, the commit also has significant impact on the following tests:
+------------------+----------------------------------------------------------------------+
| testcase: change | will-it-scale: will-it-scale.per_thread_ops 4.3% improvement
|
| test machine | 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 64G memory
|
| test parameters | cpufreq_governor=performance
|
| | mode=thread
|
| | nr_task=16
|
| | test=getppid1
|
+------------------+----------------------------------------------------------------------+
Details are as below:
-------------------------------------------------------------------------------------------------->
To reproduce:
git clone
https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-7/performance/x86_64-rhel-7.2/thread/16/debian-x86_64-2018-04-03.cgz/lkp-bdw-ep3d/futex3/will-it-scale
commit:
cba614f88d ("x86/entry/64: Use the TSS sp2 slot for rsp_scratch")
de87ae2926 ("x86/pti/64: Remove the SYSCALL64 entry trampoline")
cba614f88dfdeadd de87ae29269664b890e5323ff9
---------------- --------------------------
%stddev %change %stddev
\ | \
3518673 +3.8% 3650732 will-it-scale.per_thread_ops
3182 -2.2% 3112 will-it-scale.time.system_time
1633 +4.3% 1703 will-it-scale.time.user_time
56298775 +3.8% 58411722 will-it-scale.workload
4744393 ±120% -71.5% 1351494 ± 2% cpuidle.C1.time
2328850 ± 6% +10.4% 2570720 ± 7% softirqs.TIMER
1717 ± 13% +29.9% 2231 ± 4% numa-meminfo.node0.PageTables
15164 ± 18% -62.4% 5695 ± 86% numa-meminfo.node0.Shmem
2576 ± 9% -20.8% 2041 ± 4% numa-meminfo.node1.PageTables
60960 ± 24% -36.5% 38712 ± 19% numa-numastat.node0.local_node
62186 ± 23% -31.4% 42639 ± 16% numa-numastat.node0.numa_hit
1226 ± 57% +220.2% 3926 ± 20% numa-numastat.node0.other_node
3544 ± 20% -76.3% 840.00 ± 95% numa-numastat.node1.other_node
429.25 ± 13% +29.9% 557.50 ± 4% numa-vmstat.node0.nr_page_table_pages
3794 ± 18% -62.5% 1423 ± 86% numa-vmstat.node0.nr_shmem
1373 ± 50% +194.5% 4044 ± 20% numa-vmstat.node0.numa_other
643.25 ± 9% -20.6% 510.75 ± 4% numa-vmstat.node1.nr_page_table_pages
650501 +2.2% 664689 proc-vmstat.numa_hit
645729 +2.2% 659920 proc-vmstat.numa_local
38.50 ± 81% +11564.9% 4491 ± 55% proc-vmstat.numa_pages_migrated
2461 ± 66% +622.7% 17786 ± 66% proc-vmstat.numa_pte_updates
699026 +2.3% 715240 proc-vmstat.pgalloc_normal
783179 +1.6% 795471 proc-vmstat.pgfault
683943 +1.9% 697109 proc-vmstat.pgfree
38.50 ± 81% +11564.9% 4491 ± 55% proc-vmstat.pgmigrate_success
23.57 ± 3% -23.6 0.00
perf-profile.calltrace.cycles-pp.__entry_trampoline_start
0.00 +25.7 25.66 ± 4%
perf-profile.calltrace.cycles-pp.entry_SYSCALL_64
23.68 ± 3% -23.7 0.00
perf-profile.children.cycles-pp.__entry_trampoline_start
0.00 +1.1 1.14 ± 6%
perf-profile.children.cycles-pp.__x86_indirect_thunk_rax
0.00 +25.7 25.69 ± 4%
perf-profile.children.cycles-pp.entry_SYSCALL_64
23.68 ± 3% -23.7 0.00
perf-profile.self.cycles-pp.__entry_trampoline_start
4.62 ± 6% -1.7 2.95 ± 5% perf-profile.self.cycles-pp.do_syscall_64
2.49 ± 5% -0.3 2.23 ± 5%
perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
0.00 +1.1 1.07 ± 5%
perf-profile.self.cycles-pp.__x86_indirect_thunk_rax
0.00 +25.7 25.69 ± 4%
perf-profile.self.cycles-pp.entry_SYSCALL_64
293.92 ± 22% -27.7% 212.58 ± 6% sched_debug.cfs_rq:/.load_avg.max
81.81 ± 8% -14.8% 69.67 ± 15% sched_debug.cfs_rq:/.load_avg.stddev
10489 ± 4% +16.9% 12261 ± 6% sched_debug.cfs_rq:/.min_vruntime.min
3.27 ± 16% -74.0% 0.85 ±101% sched_debug.cfs_rq:/.removed.util_avg.avg
69.08 ± 24% -72.9% 18.75 ±100% sched_debug.cfs_rq:/.removed.util_avg.max
14.37 ± 19% -73.8% 3.76 ±100%
sched_debug.cfs_rq:/.removed.util_avg.stddev
169390 ± 15% +41.4% 239515 ± 19% sched_debug.cpu.avg_idle.min
2349 ± 21% -29.7% 1651 ± 18% sched_debug.cpu.nr_load_updates.stddev
62893 ± 28% -30.1% 43970 ± 14% sched_debug.cpu.sched_count.max
14764 ± 17% -17.6% 12171 ± 10% sched_debug.cpu.sched_count.stddev
279.25 ± 6% -18.2% 228.54 ± 8% sched_debug.cpu.ttwu_count.min
4.42 -1.3 3.12 perf-stat.branch-miss-rate%
5.174e+10 -29.9% 3.624e+10 perf-stat.branch-misses
33184723 ± 4% +15.1% 38191250 ± 12% perf-stat.cache-misses
8.231e+08 +7.9% 8.883e+08 ± 4% perf-stat.cache-references
1.83 -2.2% 1.79 perf-stat.cpi
2.071e+12 +2.1% 2.114e+12 perf-stat.dTLB-loads
27.80 ± 2% +71.5 99.28 perf-stat.iTLB-load-miss-rate%
9.74e+09 ± 4% +94.9% 1.898e+10 perf-stat.iTLB-load-misses
2.528e+10 -99.5% 1.383e+08 ± 16% perf-stat.iTLB-loads
8.152e+12 +2.4% 8.345e+12 perf-stat.instructions
838.42 ± 4% -47.6% 439.73 perf-stat.instructions-per-iTLB-miss
0.55 +2.3% 0.56 perf-stat.ipc
762567 +1.5% 773691 perf-stat.minor-faults
4676629 ± 30% -55.2% 2092806 ± 4% perf-stat.node-stores
762578 +1.5% 773695 perf-stat.page-faults
144800 -1.3% 142864 perf-stat.path-length
will-it-scale.per_thread_ops
4e+06 +-+---------------------------------------------------------------+
O O O O O O O O O O O O O O O O O O O O O O O O O O |
3.5e+06 +-+.+.+.+.+.+.+..+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+..+.+.+.+.+.+.+.+.|
3e+06 +-+ |
| |
2.5e+06 +-+ |
| |
2e+06 +-+ |
| |
1.5e+06 +-+ |
1e+06 +-+ |
| |
500000 +-+ |
| |
0 +-+-----------O---------------------------------------------------+
will-it-scale.workload
6e+07 O-O-O-O--O-O-O---O-O-O-O--O-O-O-O-O-O-O-O--O-O-O-O-O-O-O------------+
|.+.+.+..+.+.+.+.+.+.+.+..+.+.+.+.+.+.+.+..+.+.+.+.+.+.+.+..+.+.+.+.|
5e+07 +-+ |
| |
| |
4e+07 +-+ |
| |
3e+07 +-+ |
| |
2e+07 +-+ |
| |
| |
1e+07 +-+ |
| |
0 +-+------------O----------------------------------------------------+
will-it-scale.time.user_time
1800 +-+------------------------------------------------------------------+
O.O.O.O..O.O.O.+.O.O..O.O.O.O.O.O..O.O.O.O.O.O.O..O.O.O.O.+.+..+.+.+.|
1600 +-+ |
1400 +-+ |
| |
1200 +-+ |
1000 +-+ |
| |
800 +-+ |
600 +-+ |
| |
400 +-+ |
200 +-+ |
| |
0 +-+------------O-----------------------------------------------------+
will-it-scale.time.system_time
3500 +-+------------------------------------------------------------------+
|.+.+.+..+.+.O.+.+.+..+.+.+.+.+.+..+.+.+.+.+.+.+..+.+.+.+.+.+..+.+.+.|
3000 O-O O O O O O O O O O O O O O O O O O O O O O O O |
| |
2500 +-+ |
| |
2000 +-+ |
| |
1500 +-+ |
| |
1000 +-+ |
| |
500 +-+ |
| |
0 +-+------------O-----------------------------------------------------+
[*] bisect-good sample
[O] bisect-bad sample
***************************************************************************************************
lkp-bdw-ep3d: 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 64G memory
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-7/performance/x86_64-rhel-7.2/thread/16/debian-x86_64-2018-04-03.cgz/lkp-bdw-ep3d/getppid1/will-it-scale
commit:
cba614f88d ("x86/entry/64: Use the TSS sp2 slot for rsp_scratch")
de87ae2926 ("x86/pti/64: Remove the SYSCALL64 entry trampoline")
cba614f88dfdeadd de87ae29269664b890e5323ff9
---------------- --------------------------
%stddev %change %stddev
\ | \
4145966 +4.3% 4324715 will-it-scale.per_thread_ops
66335462 +4.3% 69195456 will-it-scale.workload
0.00 ± 30% +0.0 0.01 ± 27% mpstat.cpu.soft%
1570 ±141% +51.1% 2372 ± 98% numa-numastat.node0.other_node
2070 ± 4% +25.0% 2588 ± 17% numa-meminfo.node1.PageTables
25821 ± 8% +8.7% 28067 ± 5% numa-meminfo.node1.SReclaimable
1208 ± 4% +5.7% 1276 ± 5% slabinfo.eventpoll_pwq.active_objs
1208 ± 4% +5.7% 1276 ± 5% slabinfo.eventpoll_pwq.num_objs
6962 +2.7% 7149 proc-vmstat.nr_shmem
430.33 +770.7% 3747 ± 87% proc-vmstat.numa_hint_faults_local
3875 +5.8% 4099 ± 4% proc-vmstat.pgactivate
16121 ± 13% -32.4% 10896 ± 11% numa-vmstat.node0
1687 ±129% +47.6% 2489 ± 92% numa-vmstat.node0.numa_other
9365 ± 22% +55.3% 14544 ± 8% numa-vmstat.node1
517.33 ± 4% +25.1% 647.00 ± 17% numa-vmstat.node1.nr_page_table_pages
6454 ± 8% +8.7% 7016 ± 5% numa-vmstat.node1.nr_slab_reclaimable
8466921 ± 97% -83.8% 1374133 ± 2% cpuidle.C1.time
148474 ± 64% -52.1% 71137 ± 2% cpuidle.C1.usage
2.471e+08 ± 29% -98.4% 3942202 cpuidle.C1E.time
1802745 ± 64% -98.7% 24177 cpuidle.C1E.usage
1.408e+08 ± 59% +698.7% 1.124e+09 ± 4% cpuidle.C3.time
167633 ± 52% +1466.7% 2626385 ± 2% cpuidle.C3.usage
1.971e+09 ± 4% -34.8% 1.286e+09 ± 2% cpuidle.C6.time
2098510 ± 4% +8.6% 2278392 ± 5% cpuidle.C6.usage
38191909 ±140% -99.9% 21778 ± 6% cpuidle.POLL.time
617067 ±140% -99.7% 2139 cpuidle.POLL.usage
145146 ± 66% -53.1% 68095 turbostat.C1
0.12 ± 99% -0.1 0.02 turbostat.C1%
1802589 ± 64% -98.7% 24071 turbostat.C1E
3.40 ± 29% -3.4 0.05 turbostat.C1E%
167399 ± 52% +1468.9% 2626260 ± 2% turbostat.C3
1.93 ± 59% +13.5 15.45 ± 4% turbostat.C3%
2097688 ± 4% +8.6% 2277836 ± 5% turbostat.C6
27.12 ± 4% -9.5 17.66 ± 2% turbostat.C6%
0.01 ± 81% +1550.0% 0.17 ± 69% turbostat.CPU%c3
0.39 ± 50% -52.2% 0.18 ± 24% turbostat.CPU%c6
7.908e+11 -5.5% 7.469e+11 perf-stat.branch-instructions
5.21 ± 2% -2.3 2.89 perf-stat.branch-miss-rate%
4.12e+10 -47.5% 2.162e+10 perf-stat.branch-misses
40984991 ± 3% -5.2% 38849817 ± 2% perf-stat.cache-misses
0.01 ± 23% -0.0 0.00 ± 8% perf-stat.dTLB-load-miss-rate%
1.229e+08 ± 23% -76.8% 28492076 ± 7% perf-stat.dTLB-load-misses
0.00 ± 19% -0.0 0.00 ± 12% perf-stat.dTLB-store-miss-rate%
8136287 ± 19% -38.6% 4992992 ± 11% perf-stat.dTLB-store-misses
9.911e+11 -2.1% 9.703e+11 perf-stat.dTLB-stores
2.92 ± 22% +90.8 93.67 perf-stat.iTLB-load-miss-rate%
9.176e+08 ± 24% +2336.6% 2.236e+10 ± 2% perf-stat.iTLB-load-misses
3.041e+10 -95.0% 1.511e+09 perf-stat.iTLB-loads
4858 ± 26% -96.2% 185.77 ± 2% perf-stat.instructions-per-iTLB-miss
0.28 +0.4% 0.28 perf-stat.ipc
94.74 -2.1 92.67 perf-stat.node-load-miss-rate%
631159 ± 4% +34.9% 851575 perf-stat.node-loads
62949 -4.7% 59990 perf-stat.path-length
421.22 ± 21% +24.4% 524.13 ± 8% sched_debug.cfs_rq:/.exec_clock.min
40629 ± 2% +19.9% 48710 ± 10% sched_debug.cfs_rq:/.load.avg
65283 ± 2% +136.6% 154433 ± 51% sched_debug.cfs_rq:/.load.max
25268 ± 3% +63.6% 41328 ± 35% sched_debug.cfs_rq:/.load.stddev
262.06 ± 27% +17.0% 306.67 ± 19% sched_debug.cfs_rq:/.load_avg.max
9970 ± 17% +14.2% 11388 ± 8% sched_debug.cfs_rq:/.min_vruntime.min
0.69 ± 2% +10.7% 0.77 ± 5% sched_debug.cfs_rq:/.nr_running.avg
39579 ± 2% +16.3% 46028 ± 10% sched_debug.cfs_rq:/.runnable_weight.avg
56834 ± 2% +160.2% 147884 ± 55% sched_debug.cfs_rq:/.runnable_weight.max
23517 ± 3% +68.5% 39632 ± 39% sched_debug.cfs_rq:/.runnable_weight.stddev
333.16 ± 23% +34.7% 448.79 ± 14% sched_debug.cfs_rq:/.util_est_enqueued.avg
791.83 ± 23% +25.7% 995.42 ± 2% sched_debug.cfs_rq:/.util_est_enqueued.max
306.76 ± 26% +27.2% 390.29 ± 7%
sched_debug.cfs_rq:/.util_est_enqueued.stddev
258487 ± 15% -36.7% 163619 ± 27% sched_debug.cpu.avg_idle.min
175699 ± 5% +17.7% 206786 ± 3% sched_debug.cpu.avg_idle.stddev
0.11 ±141% +425.0% 0.58 ± 14% sched_debug.cpu.cpu_load[2].min
0.33 ±141% +150.0% 0.83 ± 40% sched_debug.cpu.cpu_load[4].min
40053 ± 2% +17.4% 47018 ± 9% sched_debug.cpu.load.avg
65371 ± 2% +136.2% 154433 ± 51% sched_debug.cpu.load.max
25663 ± 2% +64.5% 42219 ± 36% sched_debug.cpu.load.stddev
0.00 ± 6% +7.5% 0.00 ± 7% sched_debug.cpu.next_balance.stddev
1645 ± 9% +43.0% 2353 ± 7% sched_debug.cpu.nr_load_updates.stddev
0.69 ± 2% +11.6% 0.77 ± 3% sched_debug.cpu.nr_running.avg
0.01 -75.0% 0.00 ±100% sched_debug.cpu.nr_uninterruptible.avg
3.86 ± 5% +5.4% 4.06 ± 5% sched_debug.cpu.nr_uninterruptible.stddev
8404 ± 7% +8.5% 9118 ± 6% sched_debug.cpu.sched_count.avg
195.00 ± 32% +34.9% 263.00 ± 6% sched_debug.cpu.ttwu_count.min
7183 ± 79% -62.8% 2669 ± 10% sched_debug.cpu.ttwu_local.max
30.00 -30.0 0.00
perf-profile.calltrace.cycles-pp.__entry_trampoline_start
19.13 -3.4 15.74 ± 2%
perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
16.13 -2.6 13.57
perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
11.06 -1.0 10.02
perf-profile.calltrace.cycles-pp.__x64_sys_getppid.do_syscall_64.entry_SYSCALL_64_after_hwframe
16.31 +5.2 21.48 ± 15%
perf-profile.calltrace.cycles-pp.secondary_startup_64
16.31 +5.2 21.48 ± 15%
perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64
16.31 +5.2 21.48 ± 15%
perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64
16.31 +5.2 21.48 ± 15%
perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64
16.25 +5.2 21.46 ± 15%
perf-profile.calltrace.cycles-pp.cpuidle_enter_state.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64
15.74 ± 4% +5.7 21.45 ± 15%
perf-profile.calltrace.cycles-pp.intel_idle.cpuidle_enter_state.do_idle.cpu_startup_entry.start_secondary
0.00 +30.8 30.82 ± 4%
perf-profile.calltrace.cycles-pp.entry_SYSCALL_64
30.13 -30.1 0.00
perf-profile.children.cycles-pp.__entry_trampoline_start
19.25 -3.2 16.05 ± 2%
perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
16.30 -2.5 13.78
perf-profile.children.cycles-pp.do_syscall_64
10.57 -0.4 10.15
perf-profile.children.cycles-pp.__x64_sys_getppid
0.46 ± 4% -0.2 0.30 ± 3%
perf-profile.children.cycles-pp.apic_timer_interrupt
0.42 ± 4% -0.1 0.28 ± 3%
perf-profile.children.cycles-pp.smp_apic_timer_interrupt
0.34 ± 6% -0.1 0.24 ± 2%
perf-profile.children.cycles-pp.hrtimer_interrupt
0.21 ± 4% -0.1 0.16 ± 6%
perf-profile.children.cycles-pp.__hrtimer_run_queues
0.17 ± 4% -0.1 0.12 ± 8%
perf-profile.children.cycles-pp.tick_sched_timer
0.14 ± 3% -0.0 0.11 ± 4%
perf-profile.children.cycles-pp.tick_sched_handle
0.14 ± 5% -0.0 0.11 ± 4%
perf-profile.children.cycles-pp.update_process_times
0.08 ± 6% -0.0 0.06 ± 9%
perf-profile.children.cycles-pp.clockevents_program_event
0.00 +0.7 0.66
perf-profile.children.cycles-pp.__x86_indirect_thunk_rax
16.31 +5.2 21.48 ± 15%
perf-profile.children.cycles-pp.secondary_startup_64
16.31 +5.2 21.48 ± 15%
perf-profile.children.cycles-pp.cpu_startup_entry
16.31 +5.2 21.48 ± 15% perf-profile.children.cycles-pp.do_idle
16.31 +5.2 21.48 ± 15%
perf-profile.children.cycles-pp.start_secondary
16.26 +5.2 21.46 ± 15%
perf-profile.children.cycles-pp.cpuidle_enter_state
15.75 ± 4% +5.7 21.45 ± 15% perf-profile.children.cycles-pp.intel_idle
0.00 +30.9 30.86 ± 4%
perf-profile.children.cycles-pp.entry_SYSCALL_64
30.13 -30.1 0.00
perf-profile.self.cycles-pp.__entry_trampoline_start
5.43 ± 2% -2.3 3.17 ± 4% perf-profile.self.cycles-pp.do_syscall_64
3.04 -0.5 2.54 ± 6%
perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
0.00 +0.6 0.60
perf-profile.self.cycles-pp.__x86_indirect_thunk_rax
15.75 ± 4% +5.7 21.45 ± 15% perf-profile.self.cycles-pp.intel_idle
0.00 +30.9 30.86 ± 4%
perf-profile.self.cycles-pp.entry_SYSCALL_64
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
Thanks,
Rong, Chen