致謝William Chan和Google。比快的memcpy在Microsoft Visual Studio 2005
void X_aligned_memcpy_sse2(void* dest, const void* src, const unsigned long size)
{
__asm
{
mov esi, src; //src pointer
mov edi, dest; //dest pointer
mov ebx, size; //ebx is our counter
shr ebx, 7; //divide by 128 (8 * 128bit registers)
loop_copy:
prefetchnta 128[ESI]; //SSE2 prefetch
prefetchnta 160[ESI];
prefetchnta 192[ESI];
prefetchnta 224[ESI];
movdqa xmm0, 0[ESI]; //move data from src to registers
movdqa xmm1, 16[ESI];
movdqa xmm2, 32[ESI];
movdqa xmm3, 48[ESI];
movdqa xmm4, 64[ESI];
movdqa xmm5, 80[ESI];
movdqa xmm6, 96[ESI];
movdqa xmm7, 112[ESI];
movntdq 0[EDI], xmm0; //move data from registers to dest
movntdq 16[EDI], xmm1;
movntdq 32[EDI], xmm2;
movntdq 48[EDI], xmm3;
movntdq 64[EDI], xmm4;
movntdq 80[EDI], xmm5;
movntdq 96[EDI], xmm6;
movntdq 112[EDI], xmm7;
add esi, 128;
add edi, 128;
dec ebx;
jnz loop_copy; //loop please
loop_copy_end:
}
}
您可以去優化它進一步根據您的具體情況,你可以做任何假設30-70%。
您可能還想查看memcpy源(memcpy.asm)並去除其特殊情況處理。有可能進一步優化!
你能寫你的代碼,所以副本不是必需的嗎? – Ron 2009-11-11 13:44:27
Ron,不,我不能:( – horseyguy 2009-11-11 13:47:32
如果你能得到英特爾編譯器的保留,你可能有更好的機會將優化器轉換成矢量cpu指令 – 2009-11-11 13:54:05