C++原子學和跨線程能見度

AFAIK C++原子公司（<atomic>）家族提供3個好處：C++原子學和跨線程能見度

基本指令不可分割（無髒讀），
存儲器排序（既，對於CPU和編譯器）和
跨線程可見性/變化傳播。

我不確定第三個項目符號，因此請看下面的示例。

#include <atomic> 

std::atomic_bool a_flag = ATOMIC_VAR_INIT(false); 
struct Data { 
    int x; 
    long long y; 
    char const* z; 
} data; 

void thread0() 
{ 
    // due to "release" the data will be written to memory 
    // exactly in the following order: x -> y -> z 
    data.x = 1; 
    data.y = 100; 
    data.z = "foo"; 
    // there can be an arbitrary delay between the write 
    // to any of the members and it's visibility in other 
    // threads (which don't synchronize explicitly) 

    // atomic_bool guarantees that the write to the "a_flag" 
    // will be clean, thus no other thread will ever read some 
    // strange mixture of 4bit + 4bits 
    a_flag.store(true, std::memory_order_release); 
} 

void thread1() 
{ 
    while (a_flag.load(std::memory_order_acquire) == false) {}; 
    // "acquire" on a "released" atomic guarantees that all the writes from 
    // thread0 (thus data members modification) will be visible here 
} 

void thread2() 
{ 
    while (data.y != 100) {}; 
    // not "acquiring" the "a_flag" doesn't guarantee that will see all the 
    // memory writes, but when I see the z == 100 I know I can assume that 
    // prior writes have been done due to "release ordering" => assert(x == 1) 
} 

int main() 
{ 
    thread0(); // concurrently 
    thread1(); // concurrently 
    thread2(); // concurrently 

    // join 

    return 0; 
}

首先，請驗證我在代碼中的假設（特別是thread2）。

其次，我的問題是：

怎樣的a_flag寫傳播到其他核心？
std::atomic是否將寫入器緩存中的a_flag與其他內核緩存（使用MESI或其他）進行同步，或者傳播是自動的？
假設在一臺特定的機器上寫入一個標誌是原子的（在x86上認爲是int_32）並且我們沒有任何私有內存來同步（我們只有一個標誌）我們是否需要使用原子？
考慮到最流行的CPU體系結構（86，64，ARM v.whatever，IA-64），是跨芯能見度（我現在不考慮重排序）自動的（但可能延遲），或者你需要發佈特定的命令來傳播任何數據？

來源

2013-10-17 Red XIII

核心本身並不重要。問題是「所有內核如何最終看到相同的內存更新」，這是您的硬件爲您做的事情（例如緩存一致性協議）。只有一個內存，所以主要關心的是緩存，這是硬件的一個私人問題。
這個問題似乎還不清楚。重要的是通過a_flag加載和存儲，這是一個同步點，並導致thread0和thread1影響的商店之前發生一切之前出現在一個特定的順序（即，在thread0一切形成的獲取 - 釋放對在thread1的循環之後）。
是的，否則你不會有同步點。
在C++中不需要任何「命令」。 C++甚至沒有意識到它運行在任何特定類型的CPU上。你可以用一個充滿想象力的魔方運行一個C++程序。 C++ 編譯器選擇必要的指令來實現由C++內存模型描述的同步行爲，並在x86上執行涉及發出指令鎖前綴和內存屏蔽的同步行爲，以及不會過多地重新排序指令。由於x86有一個強排序的內存模型，與沒有原子的天真的，不正確的代碼相比，上面的代碼應該產生最少的附加代碼。
讓代碼中的thread2使整個程序未定義的行爲。

只是爲了好玩，並表明工作了所發生的事情爲自己能有啓發，我編譯的代碼中的三個變化。（我加了glbbal int x和thread1我加了x = data.y;）。

採集/發佈：（代碼）

thread0: 
    mov DWORD PTR data, 1 
    mov DWORD PTR data+4, 100 
    mov DWORD PTR data+8, 0 
    mov DWORD PTR data+12, OFFSET FLAT:.LC0 
    mov BYTE PTR a_flag, 1 
    ret 

thread1: 
.L14: 
    movzx eax, BYTE PTR a_flag 
    test al, al 
    je .L14 
    mov eax, DWORD PTR data+4 
    mov DWORD PTR x, eax 
    ret

順序一致性：（除去明確的排序）

thread0: 
    mov eax, 1 
    mov DWORD PTR data, 1 
    mov DWORD PTR data+4, 100 
    mov DWORD PTR data+8, 0 
    mov DWORD PTR data+12, OFFSET FLAT:.LC0 
    xchg al, BYTE PTR a_flag 
    ret 

thread1: 
.L14: 
    movzx eax, BYTE PTR a_flag 
    test al, al 
    je .L14 
    mov eax, DWORD PTR data+4 
    mov DWORD PTR x, eax 
    ret

「天真」：（只用bool）

thread0: 
    mov DWORD PTR data, 1 
    mov DWORD PTR data+4, 100 
    mov DWORD PTR data+8, 0 
    mov DWORD PTR data+12, OFFSET FLAT:.LC0 
    mov BYTE PTR a_flag, 1 
    ret 

thread1: 
    cmp BYTE PTR a_flag, 0 
    jne .L3 
.L4: 
    jmp .L4 
.L3: 
    mov eax, DWORD PTR data+4 
    mov DWORD PTR x, eax 
    ret

正如你所看到的，沒有太大的區別。「錯誤」版本實際上看起來大部分是正確的，除了缺少負載（它使用內存操作數使用cmp）。順序一致的版本在xcgh指令中隱藏了其昂貴的代碼，該指令具有隱含的鎖前綴，並且似乎不需要任何明確的屏蔽。

來源

2013-10-17 08:21:07

C++原子學和跨線程能見度

回答

相關問題