2013-08-04 54 views
1

我有可怕的時間試圖找出爲什麼我的同步。代碼在使用pthread庫時會死鎖。使用winapi原語而不是pthread可以毫無問題地工作。使用C++ 11線程也可以正常工作(除非使用visual studio 2012 service pack 3進行編譯,那麼它只會崩潰 - 微軟將它視爲一個bug)。然而,使用pthread證明是一個問題 - 至少在linux機器上運行,沒有機會嘗試不同的操作系統。使用pthread_cond_t時pthread死鎖問題

我寫了一個簡單的程序來說明問題。代碼只是顯示死鎖 - 我很清楚設計非常糟糕,可以寫得更好。

typedef struct _pthread_event 
{ 
    pthread_mutex_t Mutex; 
    pthread_cond_t Condition; 
    unsigned char State; 
} pthread_event; 

void pthread_event_create(pthread_event * ev , unsigned char init_state) 
{ 
    pthread_mutex_init(&ev->Mutex , 0); 
    pthread_cond_init(&ev->Condition , 0); 
    ev->State = init_state; 
} 

void pthread_event_destroy(pthread_event * ev) 
{ 
    pthread_cond_destroy(&ev->Condition); 
    pthread_mutex_destroy(&ev->Mutex); 
} 

void pthread_event_set(pthread_event * ev , unsigned char state) 
{ 
    pthread_mutex_lock(&ev->Mutex); 
    ev->State = state; 
    pthread_mutex_unlock(&ev->Mutex); 
    pthread_cond_broadcast(&ev->Condition); 
} 

unsigned char pthread_event_get(pthread_event * ev) 
{ 
    unsigned char result; 
    pthread_mutex_lock(&ev->Mutex); 
    result = ev->State; 
    pthread_mutex_unlock(&ev->Mutex); 
    return result; 
} 

unsigned char pthread_event_wait(pthread_event * ev , unsigned char state , unsigned int timeout_ms) 
{ 
    struct timeval time_now; 
    struct timespec timeout_time; 
    unsigned char result; 

    gettimeofday(&time_now , NULL); 
    timeout_time.tv_sec = time_now.tv_sec   + (timeout_ms/1000); 
    timeout_time.tv_nsec = time_now.tv_usec * 1000 + ((timeout_ms % 1000) * 1000000); 

    pthread_mutex_lock(&ev->Mutex); 
    while (ev->State != state) 
      if (ETIMEDOUT == pthread_cond_timedwait(&ev->Condition , &ev->Mutex , &timeout_time)) break; 

    result = ev->State; 
    pthread_mutex_unlock(&ev->Mutex); 
    return result; 
} 

static pthread_t  thread_1; 
static pthread_t  thread_2; 
static pthread_event data_ready; 
static pthread_event data_needed; 

void * thread_fx1(void * c) 
{ 
    for (; ;) 
    { 
     pthread_event_wait(&data_needed , 1 , 90); 
     pthread_event_set(&data_needed , 0); 
     usleep(100000); 
     pthread_event_set(&data_ready , 1); 
     printf("t1: tick\n"); 
    } 
} 

void * thread_fx2(void * c) 
{ 
    for (; ;) 
    { 
     pthread_event_wait(&data_ready , 1 , 50); 
     pthread_event_set(&data_ready , 0); 
     pthread_event_set(&data_needed , 1); 
     usleep(100000); 
     printf("t2: tick\n"); 
    } 
} 


int main(int argc , char * argv[]) 
{ 
    pthread_event_create(&data_ready , 0); 
    pthread_event_create(&data_needed , 0); 

    pthread_create(&thread_1 , NULL , thread_fx1 , 0); 
    pthread_create(&thread_2 , NULL , thread_fx2 , 0); 

    pthread_join(thread_1 , NULL); 
    pthread_join(thread_2 , NULL); 

    pthread_event_destroy(&data_ready); 
    pthread_event_destroy(&data_needed); 

    return 0; 
} 

基本上兩個線程信號相互 - 開始做一些事情,做哪怕一個短暫的停頓後,不表明他們自己的事情。

任何想法是什麼問題呢?

謝謝。

回答

1

問題是參數pthread_cond_timedwait()timeout_time參數。你增加它的方式最終很快就會有一個無效的值,其中納秒的部分大於或等於十億分之一。在這種情況下,pthread_cond_timedwait()可能會返回EINVAL,並且可能實際上是在等待條件之前。

的問題可以發現很快valgrind --tool=helgrind ./test_prog(太快了,他們說他們已經檢測到千萬錯誤,並放棄了計數):

bash$ gcc -Werror -Wall -g test.c -o test -lpthread && valgrind --tool=helgrind ./test 
==3035== Helgrind, a thread error detector 
==3035== Copyright (C) 2007-2012, and GNU GPL'd, by OpenWorks LLP et al. 
==3035== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info 
==3035== Command: ./test 
==3035== 
t1: tick 
t2: tick 
t2: tick 
t1: tick 
t2: tick 
t1: tick 
t1: tick 
t2: tick 
t1: tick 
t2: tick 
t1: tick 
==3035== ---Thread-Announcement------------------------------------------ 
==3035== 
==3035== Thread #2 was created 
==3035== at 0x41843C8: clone (clone.S:110) 
==3035== 
==3035== ---------------------------------------------------------------- 
==3035== 
==3035== Thread #2's call to pthread_cond_timedwait failed 
==3035== with error code 22 (EINVAL: Invalid argument) 
==3035== at 0x402DB03: pthread_cond_timedwait_WRK (hg_intercepts.c:784) 
==3035== by 0x8048910: pthread_event_wait (test.c:65) 
==3035== by 0x8048965: thread_fx1 (test.c:80) 
==3035== by 0x402E437: mythread_wrapper (hg_intercepts.c:219) 
==3035== by 0x407DD77: start_thread (pthread_create.c:311) 
==3035== by 0x41843DD: clone (clone.S:131) 
==3035== 
t2: tick 
==3035== 
==3035== More than 10000000 total errors detected. I'm not reporting any more. 
==3035== Final error counts will be inaccurate. Go fix your program! 
==3035== Rerun with --error-limit=no to disable this cutoff. Note 
==3035== that errors may occur in your program without prior warning from 
==3035== Valgrind, because errors are no longer being displayed. 
==3035== 
^C==3035== 
==3035== For counts of detected and suppressed errors, rerun with: -v 
==3035== Use --history-level=approx or =none to gain increased speed, at 
==3035== the cost of reduced accuracy of conflicting-access information 
==3035== ERROR SUMMARY: 10000000 errors from 1 contexts (suppressed: 412 from 109) 
Killed 

還有其他兩個小意見:

  1. 爲了提高正確性,在您的pthread_event_set()中,您可以在互斥鎖解除之前完成條件變量廣播(錯誤排序的影響基本上可能會破壞鱗片蛋白的確定性G; helgrind也抱怨這個問題);
  2. 您可以安全地刪除pthread_event_get()中的互斥鎖以返回ev->State的值 - 這應該是原子操作。
+0

哇,這實際上是比我更怕的修復方法。剛添加\t timeout_time.tv_sec + = timeout_time.tv_nsec/1000000000; timeout_time.tv_nsec = timeout_time.tv_nsec%1000000000; 正常化,它運行良好。謝謝! – user1976633