Yes, that's what we suspected. And I just did another try to
percpu mce structure aligned. And the regression seems to be gone (reduced
from 14.1% to 2%), which further proved it.
I wonder whether it would be useful for bisection of performance issues
for you to change the global definition of DEFINE_PER_CPU() to make
all per CPU definitions aligned. Just like you switch compiler flags to make
all functions aligned.