2014-04-20 79 views
4

我想知道如何將這個C代碼轉換爲C++來進行內存對齊。C++中的SSE的內存對齊,_aligned_malloc等價嗎?

float *pResult = (float*) _aligned_malloc(length * sizeof(float), 16);

我沒有看here,然後我想這 float *pResult = (float*) __attribute__((aligned(16)));

float *pResult = __attribute__((aligned(16)));

但都給出了類似的錯誤。

error: expected primary-expression before '__attribute__'| 
error: expected ',' or ';' before '__attribute__'| 

完整代碼

#include "stdafx.h" 
#include <xmmintrin.h> // Need this for SSE compiler intrinsics 
#include <math.h>  // Needed for sqrt in CPU-only version 
#include "stdio.h" 

int main(int argc, char* argv[]) 
{ 
    printf("Starting calculation...\n"); 

    const int length = 64000; 

    // We will be calculating Y = Sin(x)/x, for x = 1->64000 

    // If you do not properly align your data for SSE instructions, you may take a huge performance hit. 
    float *pResult = (float*) __attribute__((aligned(16))); // align to 16-byte for SSE 
    __m128 x; 
    __m128 xDelta = _mm_set1_ps(4.0f);  // Set the xDelta to (4,4,4,4) 
    __m128 *pResultSSE = (__m128*) pResult; 


    const int SSELength = length/4; 

    for (int stress = 0; stress < 100000; stress++) // lots of stress loops so we can easily use a stopwatch 
    { 
#define TIME_SSE // Define this if you want to run with SSE 
#ifdef TIME_SSE 
     x = _mm_set_ps(4.0f, 3.0f, 2.0f, 1.0f); // Set the initial values of x to (4,3,2,1) 
     for (int i=0; i < SSELength; i++) 
     { 
      __m128 xSqrt = _mm_sqrt_ps(x); 
      // Note! Division is slow. It's actually faster to take the reciprocal of a number and multiply 
      // Also note that Division is more accurate than taking the reciprocal and multiplying 

#define USE_DIVISION_METHOD 
#ifdef USE_FAST_METHOD 
      __m128 xRecip = _mm_rcp_ps(x); 
      pResultSSE[i] = _mm_mul_ps(xRecip, xSqrt); 
#endif //USE_FAST_METHOD 
#ifdef USE_DIVISION_METHOD 
      pResultSSE[i] = _mm_div_ps(xSqrt, x); 
#endif // USE_DIVISION_METHOD 

      // NOTE! Sometimes, the order in which things are done in SSE may seem reversed. 
      // When the command above executes, the four floating elements are actually flipped around 
      // We have already compensated for that flipping by setting the initial x vector to (4,3,2,1) instead of (1,2,3,4) 

      x = _mm_add_ps(x, xDelta); // Advance x to the next set of numbers 
     } 
#endif // TIME_SSE 
#ifndef TIME_SSE 
     float xFloat = 1.0f; 
     for (int i=0 ; i < length; i++) 
     { 
      pResult[i] = sqrt(xFloat)/xFloat; // Even though division is slow, there are no intrinsic functions like there are in SSE 
      xFloat += 1.0f; 
     } 
#endif // !TIME_SSE 
    } 

    // To prove that the program actually worked 
    for (int i=0; i < 20; i++) 
    { 
     printf("Result[%d] = %f\n", i, pResult[i]); 
    } 

    // Results for my particular system 
    // 23.75 seconds for SSE with reciprocal/multiplication method 
    // 38.5 seconds for SSE with division method 
    // 301.5 seconds for CPU 

    return 0; 
} 
+0

你可以用'_aligned_malloc'爲C或C++,但請注意,這是微軟特有的。 –

+0

我正在使用GCC,mingw呢? – pandoragami

+0

在這種情況下,使用更便攜的'memalign' /'posix_memalign'調用系列 - 有幾個相關的問題已經在StackOverflow上有了很好的答案。 –

回答

2

與C++ 11,你可以使用類似:

struct aligned_float 
{ 
    alignas(16) float f[4]; 
}; 

static_assert(sizeof(aligned_float) == 4 * sizeof(float), "padding issue"); 

int main() 
{ 
    const int length = 64000; 
    std::vector<aligned_float> pResult(length/sizeof(aligned_float)); 

    return 0; 
} 
+0

'std :: vector '和'new'實際上並不尊重過度對齊的對齊要求(至少不是在C++ 11或C++ 14中,可能稍後會改變)。這將在大多數x86-64系統上工作,因爲分配器默認已經返回16B對齊的內存。 –

1

對齊的屬性只適用於事情是如何編譯/鏈接。它沒有運行效果。

我知道解決這個問題的唯一便攜式方法是擁有一個實際分配多於必要數量的包裝,並且屏蔽掉較低位以確保它返回的內容符合最大對齊。

+0

我想我只是需要它爲GCC。最簡單的答案就足夠了。 – pandoragami