Minchan Kim <minchan(a)kernel.org> writes:
> On Thu, Jun 16, 2016 at 03:27:44PM -0700, Huang, Ying wrote:
>> Minchan Kim <minchan(a)kernel.org> writes:
>>
>> > On Thu, Jun 16, 2016 at 07:52:26AM +0800, Huang, Ying wrote:
>> >> "Kirill A. Shutemov" <kirill(a)shutemov.name> writes:
>> >>
>> >> > On Tue, Jun 14, 2016 at 05:57:28PM +0900, Minchan Kim wrote:
>> >> >> On Wed, Jun 08, 2016 at 11:58:11AM +0300, Kirill A. Shutemov
wrote:
>> >> >> > On Wed, Jun 08, 2016 at 04:41:37PM +0800, Huang, Ying
wrote:
>> >> >> > > "Huang, Ying" <ying.huang(a)intel.com>
writes:
>> >> >> > >
>> >> >> > > > "Kirill A. Shutemov"
<kirill.shutemov(a)linux.intel.com> writes:
>> >> >> > > >
>> >> >> > > >> On Mon, Jun 06, 2016 at 10:27:24AM +0800,
kernel test robot wrote:
>> >> >> > > >>>
>> >> >> > > >>> FYI, we noticed a -6.3% regression of
unixbench.score due to commit:
>> >> >> > > >>>
>> >> >> > > >>> commit
5c0a85fad949212b3e059692deecdeed74ae7ec7 ("mm: make faultaround produce old
ptes")
>> >> >> > > >>>
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>> >> >> > > >>>
>> >> >> > > >>> in testcase: unixbench
>> >> >> > > >>> on test machine: lituya: 16 threads
Haswell High-end Desktop (i7-5960X 3.0G) with 16G memory
>> >> >> > > >>> with following parameters:
cpufreq_governor=performance/nr_task=1/test=shell8
>> >> >> > > >>>
>> >> >> > > >>>
>> >> >> > > >>> Details are as below:
>> >> >> > > >>>
-------------------------------------------------------------------------------------------------->
>> >> >> > > >>>
>> >> >> > > >>>
>> >> >> > > >>>
=========================================================================================
>> >> >> > > >>>
compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase:
>> >> >> > > >>>
gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench
>> >> >> > > >>>
>> >> >> > > >>> commit:
>> >> >> > > >>>
4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
>> >> >> > > >>>
5c0a85fad949212b3e059692deecdeed74ae7ec7
>> >> >> > > >>>
>> >> >> > > >>> 4b50bcc7eda4d3cc
5c0a85fad949212b3e059692de
>> >> >> > > >>> ----------------
--------------------------
>> >> >> > > >>> fail:runs %reproduction
fail:runs
>> >> >> > > >>> | |
|
>> >> >> > > >>> 3:4 -75%
:4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
>> >> >> > > >>> %stddev %change
%stddev
>> >> >> > > >>> \ |
\
>> >> >> > > >>> 14321 . 0% -6.3% 13425
. 0% unixbench.score
>> >> >> > > >>> 1996897 . 0% -6.1% 1874635
. 0% unixbench.time.involuntary_context_switches
>> >> >> > > >>> 1.721e+08 . 0% -6.2% 1.613e+08
. 0% unixbench.time.minor_page_faults
>> >> >> > > >>> 758.65 . 0% -3.0% 735.86
. 0% unixbench.time.system_time
>> >> >> > > >>> 387.66 . 0% +5.4% 408.49
. 0% unixbench.time.user_time
>> >> >> > > >>> 5950278 . 0% -6.2% 5583456
. 0% unixbench.time.voluntary_context_switches
>> >> >> > > >>
>> >> >> > > >> That's weird.
>> >> >> > > >>
>> >> >> > > >> I don't understand why the change would
reduce number or minor faults.
>> >> >> > > >> It should stay the same on x86-64. Rise of
user_time is puzzling too.
>> >> >> > > >
>> >> >> > > > unixbench runs in fixed time mode. That is,
the total time to run
>> >> >> > > > unixbench is fixed, but the work done varies.
So the minor_page_faults
>> >> >> > > > change may reflect only the work done.
>> >> >> > > >
>> >> >> > > >> Hm. Is reproducible? Across reboot?
>> >> >> > > >
>> >> >> > >
>> >> >> > > And FYI, there is no swap setup for test, all root
file system including
>> >> >> > > benchmark files are in tmpfs, so no real page
reclaim will be
>> >> >> > > triggered. But it appears that active file cache
reduced after the
>> >> >> > > commit.
>> >> >> > >
>> >> >> > > 111331 . 1% -13.3% 96503 . 0%
meminfo.Active
>> >> >> > > 27603 . 1% -43.9% 15486 . 0%
meminfo.Active(file)
>> >> >> > >
>> >> >> > > I think this is the expected behavior of the
commit?
>> >> >> >
>> >> >> > Yes, it's expected.
>> >> >> >
>> >> >> > After the change faularound would produce old pte. It
means there's more
>> >> >> > chance for these pages to be on inactive lru, unless
somebody actually
>> >> >> > touch them and flip accessed bit.
>> >> >>
>> >> >> Hmm, tmpfs pages should be in anonymous LRU list and VM
shouldn't scan
>> >> >> anonymous LRU list on swapless system so I really wonder why
active file
>> >> >> LRU is shrunk.
>> >> >
>> >> > Hm. Good point. I don't why we have anything on file lru if
there's no
>> >> > filesystems except tmpfs.
>> >> >
>> >> > Ying, how do you get stuff to the tmpfs?
>> >>
>> >> We put root file system and benchmark into a set of compressed cpio
>> >> archive, then concatenate them into one initrd, and finally kernel use
>> >> that initrd as initramfs.
>> >
>> > I see.
>> >
>> > Could you share your 4 full vmstat(/proc/vmstat) files?
>> >
>> > old:
>> >
>> > cat /proc/vmstat > before.old.vmstat
>> > do benchmark
>> > cat /proc/vmstat > after.old.vmstat
>> >
>> > new:
>> >
>> > cat /proc/vmstat > before.new.vmstat
>> > do benchmark
>> > cat /proc/vmstat > after.new.vmstat
>> >
>> > IOW, I want to see stats related to reclaim.
>>
>> Hi,
>>
>> The /proc/vmstat for the parent commit (parent-proc-vmstat.gz) and first
>> bad commit (fbc-proc-vmstat.gz) are attached with the email.
>>
>> The contents of the file is more than the vmstat before and after
>> benchmark running, but are sampled every 1 seconds. Every sample begin
>> with "time: <time>". You can check the first and last samples.
The
>> first /proc/vmstat capturing is started at the same time of the
>> benchmark, so it is not exactly the vmstat before the benchmark running.
>>
>
> Thanks for the testing!
>
> nr_active_file was shrunk 48% but the vaule itself is not huge so
> I don't think it affects performance a lot.
>
> There was no reclaim activity for testing. :(
>
> pgfault, 6% reduced. Given that, pgalloc/free reduced 6%, too
> because unixbench was time fixed mode and 6% regressed so no
> doubt.
>
> No interesting data.
>
> It seems you tested it with THP, maybe always mode?
Yes. With following in kconfig.
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
> I'm so sorry but could you test it with disabling CONFIG_TRANSPARENT_HUGEPAGE=n
> again? it might you already did.
> Is it still 6% regressed with disabling THP?
Yes. I disabled THP via
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
The regression is the same as before.
Still, 6% user_time regression and there is no difference with previous
experiment which enabled THP so I agree the regression is caused by just
access bit setting from CPU side, which is rather surprising to me.
I don't know how unixbench shell scripts testing touch the memory
but it should be per-page overhead and 6% regression for that is too heavy.
Anyway, at least, it would be better to notice it to x86 maintainer.
Thanks for the test!
=========================================================================================
compiler/cpufreq_governor/kconfig/nr_task/rootfs/tbox_group/test/testcase/thp_defrag/thp_enabled:
gcc-4.9/performance/x86_64-rhel/1/debian-x86_64-2015-02-07.cgz/lituya/shell8/unixbench/never/never
commit:
4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5
5c0a85fad949212b3e059692deecdeed74ae7ec7
4b50bcc7eda4d3cc 5c0a85fad949212b3e059692de
---------------- --------------------------
%stddev %change %stddev
\ | \
14332 ± 0% -6.2% 13438 ± 0% unixbench.score
6662206 ± 0% -6.2% 6252260 ± 0%
unixbench.time.involuntary_context_switches
5.734e+08 ± 0% -6.2% 5.376e+08 ± 0% unixbench.time.minor_page_faults
2527 ± 0% -3.2% 2446 ± 0% unixbench.time.system_time
1291 ± 0% +5.4% 1361 ± 0% unixbench.time.user_time
19875455 ± 0% -6.3% 18622488 ± 0% unixbench.time.voluntary_context_switches
6570355 ± 0% -11.9% 5787517 ± 0% cpuidle.C1-HSW.usage
17257 ± 34% -59.1% 7055 ± 7%
latency_stats.sum.ep_poll.SyS_epoll_wait.entry_SYSCALL_64_fastpath
5976 ± 0% -43.0% 3404 ± 0% proc-vmstat.nr_active_file
45729 ± 1% -22.5% 35439 ± 1% meminfo.Active
23905 ± 0% -43.0% 13619 ± 0% meminfo.Active(file)
8465 ± 3% -29.8% 5940 ± 3% slabinfo.pid.active_objs
8476 ± 3% -29.9% 5940 ± 3% slabinfo.pid.num_objs
3.46 ± 0% +12.5% 3.89 ± 0% turbostat.CPU%c3
67.09 ± 0% -2.1% 65.65 ± 0% turbostat.PkgWatt
96090 ± 0% -5.8% 90479 ± 0% vmstat.system.cs
9083 ± 0% -2.7% 8833 ± 0% vmstat.system.in
467.35 ± 78% +416.7% 2414 ± 45% sched_debug.cfs_rq:/.MIN_vruntime.avg
7477 ± 78% +327.7% 31981 ± 39% sched_debug.cfs_rq:/.MIN_vruntime.max
1810 ± 78% +360.1% 8327 ± 40% sched_debug.cfs_rq:/.MIN_vruntime.stddev
467.35 ± 78% +416.7% 2414 ± 45% sched_debug.cfs_rq:/.max_vruntime.avg
7477 ± 78% +327.7% 31981 ± 39% sched_debug.cfs_rq:/.max_vruntime.max
1810 ± 78% +360.1% 8327 ± 40% sched_debug.cfs_rq:/.max_vruntime.stddev
-10724 ± -7% -12.0% -9433 ± -3% sched_debug.cfs_rq:/.spread0.avg
-17721 ± -4% -9.8% -15978 ± -2% sched_debug.cfs_rq:/.spread0.min
90355 ± 9% +14.1% 103099 ± 5% sched_debug.cpu.avg_idle.min
0.12 ± 35% +325.0% 0.52 ± 46% sched_debug.cpu.cpu_load[0].min
21913 ± 2% +29.1% 28288 ± 14% sched_debug.cpu.curr->pid.avg
49953 ± 3% +30.2% 65038 ± 0% sched_debug.cpu.curr->pid.max
23062 ± 2% +30.1% 29996 ± 4% sched_debug.cpu.curr->pid.stddev
274.39 ± 5% -10.2% 246.27 ± 3% sched_debug.cpu.nr_uninterruptible.max
242.73 ± 4% -13.5% 209.90 ± 2% sched_debug.cpu.nr_uninterruptible.stddev
Best Regards,
Huang, Ying