我知道這是一個古老的問題,但由於沒有人給出了使用IEEE浮點表示的解決方案,下面是一個。
// Use three unions instead of one to avoid pipeline stalls
union { float f; uint32_t i; } t, u, v, w;
t.f = 32768.0f;
float const b = 256.f/255.f;
for(int size = width * height; size > 0; --size)
{
u.i = t.i | bytepixel[0]; floatpixel[0] = (u.f - t.f) * b;
v.i = t.i | bytepixel[1]; floatpixel[1] = (v.f - t.f) * b;
w.i = t.i | bytepixel[2]; floatpixel[2] = (w.f - t.f) * b;
floatpixel[3] = 1.0f; // A
floatpixel += 4;
bytepixel += 4;
}
這比快兩倍作爲int
到float
轉換我的計算機(酷睿2 CPU)上。
下面是上述代碼的SSE3版本,每次都執行16次浮動操作。它需要bytepixel
和floatpixel
爲128位對齊,並且總大小爲4的倍數。請注意,SSE3內置的int浮點轉換在這裏沒有多大幫助,因爲它們無論如何都需要額外的乘法運算。我相信這是最簡單的教學方式,但是如果你的編譯器不夠聰明,你可能希望手動展開和安排。
/* Magic values */
__m128i zero = _mm_set_epi32(0, 0, 0, 0);
__m128i magic1 = _mm_set_epi32(0xff000000, 0xff000000, 0xff000000, 0xff000000);
__m128i magic2 = _mm_set_epi32(0x47004700, 0x47004700, 0x47004700, 0x47004700);
__m128 magic3 = _mm_set_ps(32768.0f, 32768.0f, 32768.0f, 32768.0f);
__m128 magic4 = _mm_set_ps(256.0f/255.0f, 256.0f/255.0f, 256.0f/255.0f, 256.0f/255.0f);
for(int size = width * height/4; size > 0; --size)
{
/* Load bytes in vector and force alpha value to 255 so that
* the output will be 1.0f as expected. */
__m128i in = _mm_load_si128((__m128i *)bytepixel);
in = _mm_or_si128(in, magic1);
/* Shuffle bytes into four ints ORed with 32768.0f and cast
* to float (the cast is free). */
__m128i tmplo = _mm_unpacklo_epi8(in, zero);
__m128i tmphi = _mm_unpackhi_epi8(in, zero);
__m128 in1 = _mm_castsi128_ps(_mm_unpacklo_epi16(tmplo, magic2));
__m128 in2 = _mm_castsi128_ps(_mm_unpackhi_epi16(tmplo, magic2));
__m128 in3 = _mm_castsi128_ps(_mm_unpacklo_epi16(tmphi, magic2));
__m128 in4 = _mm_castsi128_ps(_mm_unpackhi_epi16(tmphi, magic2));
/* Subtract 32768.0f and multiply by 256.0f/255.0f */
__m128 out1 = _mm_mul_ps(_mm_sub_ps(in1, magic3), magic4);
__m128 out2 = _mm_mul_ps(_mm_sub_ps(in2, magic3), magic4);
__m128 out3 = _mm_mul_ps(_mm_sub_ps(in3, magic3), magic4);
__m128 out4 = _mm_mul_ps(_mm_sub_ps(in4, magic3), magic4);
/* Store 16 floats */
_mm_store_ps(floatpixel, out1);
_mm_store_ps(floatpixel + 4, out2);
_mm_store_ps(floatpixel + 8, out3);
_mm_store_ps(floatpixel + 12, out4);
floatpixel += 16;
bytepixel += 16;
}
編輯:使用(f + c/b) * b
代替f * b + c
提高準確性。
編輯:添加SSE3版本。
爲了完整起見,字節緩衝區中的128位將在浮點緩衝區中爲.5019607843f,而不是.5f。 – 2011-03-19 13:32:00