2013-05-30 31 views
13

目前我正在使用Python進行圖像檢索。本例中從圖像中提取的關鍵點和描述符表示爲numpy.array s。形狀(2000,5)和形狀(2000,128)中的第一個。兩者都只包含dtype=numpy.float32的值。pickle比cPickle更快的數值數據?

所以,我想知道使用哪種格式來保存我提取的關鍵點和描述符。即我總是保存2個文件:一個用於關鍵點,另一個用於描述符 - 這被視爲我測量中的一個步驟。我比較picklecPickle(都與協議0和2),並與NumPy的二進制格式.pny,結果真的困惑我:

enter image description here

我一直以爲cPickle應該是比pickle模塊更快。但特別是協議0的加載時間真的在結果中顯示出來。 有沒有人對此有過解釋?是因爲我只使用數字數據嗎?似乎有些奇怪......

PS:在我的代碼我基本上在每個技術循環1000次(number=1000)和平均測量的時間到底:

timer = time.time 

    print 'npy save...' 
    t0 = timer() 
    for i in range(number): 
     numpy.save(npy_kp_path, kp) 
     numpy.save(npy_descr_path, descr) 
    t1 = timer() 
    results['npy']['save'] = t1 - t0 

    print 'npy load...' 
    t0 = timer() 
    for i in range(number): 
     kp = numpy.load(npy_kp_path) 
     descr = numpy.load(npy_descr_path) 
    t1 = timer() 
    results['npy']['load'] = t1 - t0 


    print 'pickle protocol 0 save...' 
    t0 = timer() 
    for i in range(number): 
     with open(pkl0_descr_path, 'wb') as f: 
      pickle.dump(descr, f, protocol=0) 
     with open(pkl0_kp_path, 'wb') as f: 
      pickle.dump(kp, f, protocol=0) 
    t1 = timer() 
    results['pkl0']['save'] = t1 - t0 

    print 'pickle protocol 0 load...' 
    t0 = timer() 
    for i in range(number): 
     with open(pkl0_descr_path, 'rb') as f: 
      descr = pickle.load(f) 
     with open(pkl0_kp_path, 'rb') as f: 
      kp = pickle.load(f) 
    t1 = timer() 
    results['pkl0']['load'] = t1 - t0 


    print 'cPickle protocol 0 save...' 
    t0 = timer() 
    for i in range(number): 
     with open(cpkl0_descr_path, 'wb') as f: 
      cPickle.dump(descr, f, protocol=0) 
     with open(cpkl0_kp_path, 'wb') as f: 
      cPickle.dump(kp, f, protocol=0) 
    t1 = timer() 
    results['cpkl0']['save'] = t1 - t0 

    print 'cPickle protocol 0 load...' 
    t0 = timer() 
    for i in range(number): 
     with open(cpkl0_descr_path, 'rb') as f: 
      descr = cPickle.load(f) 
     with open(cpkl0_kp_path, 'rb') as f: 
      kp = cPickle.load(f) 
    t1 = timer() 
    results['cpkl0']['load'] = t1 - t0 


    print 'pickle highest protocol (2) save...' 
    t0 = timer() 
    for i in range(number): 
     with open(pkl2_descr_path, 'wb') as f: 
      pickle.dump(descr, f, protocol=pickle.HIGHEST_PROTOCOL) 
     with open(pkl2_kp_path, 'wb') as f: 
      pickle.dump(kp, f, protocol=pickle.HIGHEST_PROTOCOL) 
    t1 = timer() 
    results['pkl2']['save'] = t1 - t0 

    print 'pickle highest protocol (2) load...' 
    t0 = timer() 
    for i in range(number): 
     with open(pkl2_descr_path, 'rb') as f: 
      descr = pickle.load(f) 
     with open(pkl2_kp_path, 'rb') as f: 
      kp = pickle.load(f) 
    t1 = timer() 
    results['pkl2']['load'] = t1 - t0 


    print 'cPickle highest protocol (2) save...' 
    t0 = timer() 
    for i in range(number): 
     with open(cpkl2_descr_path, 'wb') as f: 
      cPickle.dump(descr, f, protocol=cPickle.HIGHEST_PROTOCOL) 
     with open(cpkl2_kp_path, 'wb') as f: 
      cPickle.dump(kp, f, protocol=cPickle.HIGHEST_PROTOCOL) 
    t1 = timer() 
    results['cpkl2']['save'] = t1 - t0 

    print 'cPickle highest protocol (2) load...' 
    t0 = timer() 
    for i in range(number): 
     with open(cpkl2_descr_path, 'rb') as f: 
      descr = cPickle.load(f) 
     with open(cpkl2_kp_path, 'rb') as f: 
      kp = cPickle.load(f) 
    t1 = timer() 
    results['cpkl2']['load'] = t1 - t0 
+0

我剛纔注意到了這一點,發現你的問題,我至少得到了一個數量級的差異。醃菜肯定比cpickle快。 –

回答

6

(的二進制表示)的ndarray的數字數據被醃製爲一個長字符串。看起來cPickle確實比pickle慢得多,從協議0文件中取出大字符串。爲什麼?我的猜測是,pickle使用了標準庫中的調整好的字符串算法,並且cPickle落後了。

上面的觀察來自Python 2.7。自動使用C擴展的Python 3.3比Python 2.7上的任何一個模塊都要快,所以顯然這個問題已經得到解決。

+0

感謝您使用Python 3.3指出解決方案。我的例子使用ideed 2.7。將退房3.3 – pklip

+0

@pklip但你爲什麼要堅持使用協議0?第2號協議根據您的時間安排很快。 –

+0

當然我會使用協議2,因爲我不需要「文本可讀」文件。我只是想知道和好奇地嘗試3.3一般:) – pklip