Adding Tony so he can either confirm, or point and laugh at my
assumptions. In general you're right that there are machine check
events that are not recoverable, but I'm thinking of problems like bus
lockups and other disasters out of the direct cpu-to-memory data path.
The question is whether should we avoid the cpu consuming media errors
at all costs regardless of machine-check recovery. Tony might there be
system-fatal gaps in memcpy_mcsafe() or userspace poison consumption
handling that you would recommend aggressively trying to avoid media
errors?
TL;DR - I think it is worth it ... but I worry more about errors than most
people.
In current generation systems the two most common sources of machine
checks are memory, and I/O. They dwarf all the others like cache and
bus lockups. So it is worth trying to avoid memory issues.
Whether you can recover from a machine check triggered from a CPU
read of memory depends on which instructions you use, and the alignment
of the access. That's why memcpy_mcsafe() will start with a few byte reads
if needed to align the source address while other copy routines prefer to
align the destination ... memory writes that straddle cache lines are more
expensive than reads that do that ... but the point of the routine is to be
safe, so we drop a tiny amount of performance in the unaligned case to
make sure we will be able to recover.
We can't control how userspace will access memory ... so if we can find
the errors before they stumble into them it is a win.
-Tony