On Wed, Feb 8, 2017 at 9:42 AM, Dan Williams <dan.j.williams(a)intel.com> wrote:
On Wed, Feb 8, 2017 at 7:10 AM, Jeff Moyer <jmoyer(a)redhat.com>
> Dan Williams <dan.j.williams(a)intel.com> writes:
>> If the platform supports machine-check-recovery then there is little
>> reason to kick off opportunistic scrubs to collect a media error list.
>> That initial scrub is only useful when it might prevent a kernel panic
>> from consuming poison (a media error from memory).
> How expensive is the scrub?
The ACPI spec is not clear, but it could range from benign to
expensive and degrading system performance for 10's of minutes after
> Even on platforms that support recoverable
> machine checks, it's possible that you get one that is not recoverable.
> You haven't sold me on this change. ;-)
Adding Tony so he can either confirm, or point and laugh at my
assumptions. In general you're right that there are machine check
events that are not recoverable, but I'm thinking of problems like bus
lockups and other disasters out of the direct cpu-to-memory data path.
The question is whether should we avoid the cpu consuming media errors
at all costs regardless of machine-check recovery. Tony might there be
system-fatal gaps in memcpy_mcsafe() or userspace poison consumption
handling that you would recommend aggressively trying to avoid media
I was able to chat with Ashok and he warned that not all instructions
that consume poison can generate a recovery point. So, thanks for
prompting the double-check, we should definitely try to collect the
badblocks list regardless of the machine check recovery capability of