You can save yourself this MOV here in what is, I'm assuming,
the
general likely case where @src is aligned and do:
/* check for bad alignment of source */
testl $7, %esi
/* already aligned? */
jz 102f
movl %esi,%ecx
subl $8,%ecx
negl %ecx
subl %ecx,%edx
0: movb (%rsi),%al
movb %al,(%rdi)
incq %rsi
incq %rdi
decl %ecx
jnz 0b
The "testl $7, %esi" just checks the low three bits ... it doesn't
change %esi. But the code from the "subl $8" on down assumes that
%ecx is a number in [1..7] as the count of bytes to copy until we
achieve alignment.
So your "movl %esi,%ecx" needs to be somthing that just copies the
low three bits and zeroes the high part of %ecx. Is there a cute
way to do that in x86 assembler?
Why aren't we pushing %r12-%r15 on the stack after the "jz
17f" above
and using them too and thus copying a whole cacheline in one go?
We would need to restore them when we're done with the cacheline-wise
shuffle, of course.
I copied that loop from arch/x86/lib/copy_user_64.S:__copy_user_nocache()
I guess the answer depends on whether you generally copy enough
cache lines to save enough time to cover the cost of saving and
restoring those registers.
-Tony