下面我有兩個版本的自旋鎖。第一個使用默認的memory_order_cst,而後者使用memory_order_acquire/memory_order_release。由於後者更加寬鬆,我期望它有更好的表現。但似乎並非如此。當我使用非cst內存模型時,爲什麼我的自旋鎖實現的性能最差?
class SimpleSpinLock
{
public:
inline SimpleSpinLock(): mFlag(ATOMIC_FLAG_INIT) {}
inline void lock()
{
int backoff = 0;
while (mFlag.test_and_set()) { DoWaitBackoff(backoff); }
}
inline void unlock()
{
mFlag.clear();
}
private:
std::atomic_flag mFlag = ATOMIC_FLAG_INIT;
};
class SimpleSpinLock2
{
public:
inline SimpleSpinLock2(): mFlag(ATOMIC_FLAG_INIT) {}
inline void lock()
{
int backoff = 0;
while (mFlag.test_and_set(std::memory_order_acquire)) { DoWaitBackoff(backoff); }
}
inline void unlock()
{
mFlag.clear(std::memory_order_release);
}
private:
std::atomic_flag mFlag = ATOMIC_FLAG_INIT;
};
const int NUM_THREADS = 8;
const int NUM_ITERS = 5000000;
const int EXPECTED_VAL = NUM_THREADS * NUM_ITERS;
int val = 0;
long j = 0;
SimpleSpinLock spinLock;
void ThreadBody()
{
for (int i = 0; i < NUM_ITERS; ++i)
{
spinLock.lock();
++val;
j = i * 3.5 + val;
spinLock.unlock();
}
}
int main()
{
vector<thread> threads;
for (int i = 0; i < NUM_THREADS; ++i)
{
cout << "Creating thread " << i << endl;
threads.push_back(std::move(std::thread(ThreadBody)));
}
for (thread& thr: threads)
{
thr.join();
}
cout << "Final value: " << val << "\t" << j << endl;
assert(val == EXPECTED_VAL);
return 1;
}
我在Ubuntu 12.04上運行gcc 4.8.2運行優化O3。
- 自旋鎖與memory_order_cst:
Run 1:
real 0m1.588s
user 0m4.548s
sys 0m0.052s
Run 2:
real 0m1.577s
user 0m4.580s
sys 0m0.032s
Run 3:
real 0m1.560s
user 0m4.436s
sys 0m0.032s
- 自旋鎖與memory_order_acquire /釋放:
Run 1:
real 0m1.797s
user 0m4.608s
sys 0m0.100s
Run 2:
real 0m1.853s
user 0m4.692s
sys 0m0.164s
Run 3:
real 0m1.784s
user 0m4.552s
sys 0m0.124s
Run 4:
real 0m1.475s
user 0m3.596s
sys 0m0.120s
有了更寬鬆的模式,我看到更多的可變性。有時候會更好。通常情況更糟糕,有沒有人對此有過解釋?
如果刪除退避會發生什麼? (作爲一般規則,你會想旋轉讀取而不是原子操作。) – kec
對於英特爾上的GCC,如果不生成完全相同的代碼,我希望它們的行爲相同。您是否比較了兩種版本的「ThreadBody」的彙編輸出? – Casey
@Casey:事實證明,CST模型中有一個額外的圍欄。我不得不認真考慮是否真的需要這個意見。 – kec