如何在Python中正確使用多處理模塊？

我有110個PDF文件，我試圖從中提取圖像。一旦圖像被提取，我想刪除任何重複項並刪除小於4KB的圖像。我的代碼，這樣做看起來像這樣：如何在Python中正確使用多處理模塊？

def extract_images_from_file(pdf_file): 
    file_name = os.path.splitext(os.path.basename(pdf_file))[0] 
    call(["pdfimages", "-png", pdf_file, file_name]) 
    os.remove(pdf_file) 

def dedup_images(): 
    os.mkdir("unique_images") 
    md5_library = [] 
    images = glob("*.png") 
    print "Deleting images smaller than 4KB and generating the MD5 hash values for all other images..." 
    for image in images: 
     if os.path.getsize(image) <= 4000: 
      os.remove(image) 
     else: 
      m = md5.new() 
      image_data = list(Image.open(image).getdata()) 
      image_string = "".join(["".join([str(tpl[0]), str(tpl[1]), str(tpl[2])]) for tpl in image_data]) 
      m.update(image_string) 
      md5_library.append([image, m.digest()]) 
    headers = ['image_file', 'md5'] 
    dat = pd.DataFrame(md5_library, columns=headers).sort(['md5']) 
    dat.drop_duplicates(subset="md5", inplace=True) 

    print "Extracting the unique images." 
    unique_images = dat.image_file.tolist() 
    for image in unique_images: 
     old_file = image 
     new_file = "unique_images\\" + image 
     shutil.copy(old_file, new_file)

這個過程可能需要一段時間，所以我已經開始在多線程涉足。隨意解釋，因爲我說我不知道我在做什麼。我認爲這個過程在提取圖像方面很容易並行，但是不能進行重複數據刪除，因爲有很多I/O正在進行，我不知道該怎麼做。因此，這裏是我的嘗試在並行處理：

if __name__ == '__main__': 
    filepath = sys.argv[1] 
    folder_name = os.getcwd() + "\\all_images\\" 
    if not os.path.exists(folder_name): 
     os.mkdir(folder_name) 
    pdfs = glob("*.pdf") 
    print "Copying all PDFs to the images folder..." 
    for pdf in pdfs: 
     shutil.copy(pdf, ".\\all_images\\") 
    os.chdir("all_images") 
    pool = Pool(processes=8) 
    print "Extracting images from PDFs..." 
    pool.map(extract_images_from_file, pdfs) 
    print "Extracting unique images into a new folder..." 
    dedup_images() 
    print "All images have been extracted and deduped."

一切似乎提取圖像時，都工作得很好，但後來這一切失控了。所以這裏是我的問題：

1）我是否正確設置並行進程？
2）它是否繼續嘗試使用dedup_images()上的所有8個處理器？
3）有什麼我失蹤和/或沒有正確地做？

在此先感謝！

編輯這是我的意思是「乾草」。這些錯誤開始時有這樣的一堆線：

I/O Error: Couldn't open image If/iOl eE r'rSourb:p oICe/onOua l EdNrner'wot r Y:oo prCekon u Cliodmunan'gttey of1pi0e 
l2ne1 1i'4mS auogbiepl o2fefinrlaee e [email protected]'egSwmu abYipolor ekcn oaCm o Nupentwt y1Y -o18r16k11 8.C1po4nu gn3't4 
y7 5160120821143 3p4t7I 9/49O-8 88E78r81r.3op rnp:gt ' C 
3o-u3l6d0n.'ptn go'p 
en image file 'Ia/ ON eEwr rYoorr:k CCIoo/uuOln dtEnyr' rt1o 0ro2:p1 e1Cn4o uiolmidalng2'eft r m ' 
ai gpceoo emfn iapl teN e1'w-S 8uY6bo2pr.okpe nnCgao' u 
Nnetwy Y1o0r2k8 1C4o u3n4t7y9 918181881134 3p4t7 536-1306211.3p npgt' 
4-879.png' 
I/O Error: CoulId/nO' tE rorpoern: iCmoaugled nf'itl eo p'eub piomeangae fNielwe Y'oSrukb pCooeunnat yN e1w0 2Y8o1r 
4k 3C4o7u9n9t8y8 811032 1p1t4 3o-i3l622f pt 1-863.png'

然後變得更具可讀性多行這樣的：

I/O Error: Couldn't open image file 'pt 1-864.png' 
I/O Error: Couldn't open image file 'pt 1-865.png' 
I/O Error: Couldn't open image file 'pt 1-866.png' 
I/O Error: Couldn't open image file 'pt 1-867.png'

這重複了一會兒，亂碼之間來回錯誤文本和可讀性。

最後，它會到這裏：

Deleting images smaller than 4KB and generating the MD5 hash values for all other images... 
Extracting unique images into a new folder...

這意味着該代碼拿起備份，並與過程繼續。可能會出現什麼問題？

來源

2015-10-02 brittenb

對我來說這看起來還行。你能更具體地說「去幹草」嗎？ – strubbly

@strubbly我添加了上面的錯誤輸出。 – brittenb

「我已經開始涉足多線程了，隨着我說我不知道我在做什麼，你可以隨意解釋」你和其他開始使用併發的人。 –

您的代碼基本上是好的。

亂碼文本是所有嘗試寫入交錯控制檯的不同版本的I/O Error消息的進程。錯誤消息是由pdfimages命令生成的，可能是因爲當你同時運行兩個它們時，它們可能會通過臨時文件或兩者使用相同的文件名或類似的東西。

嘗試爲每個單獨的pdf文件使用不同的圖像根。

來源

2015-10-02 22:32:36 strubbly

我接受了這個答案，因爲它有效地解決了我遇到的問題。我將隨機的3位字母數字代碼附加到根名稱，並且它完全緩解了任何問題。謝謝！ – brittenb

很酷 - 你在多處理方面做得很好 - 只要記住你調用的東西需要能夠一起運行。他們在共享資源（如目錄或文件）時可能會發生衝突。 – strubbly

是的，Pool.map採用一個函數帶1個參數，然後是一個列表，其中的每個元素都作爲參數傳遞給第一個函數。
沒有，因爲你已經寫在這裏一切都在原來的進程中運行，除了的extract_images_from_file()身體。另外，我會認爲你正在使用8個過程，不處理器指出。如果您恰好擁有一個8核英特爾CPU，並且啓用了超線程功能，則您可以同時運行16個進程。
對我來說這看起來很好，除非如果extract_images_from_file()引發異常，它會將您的整個Pool炸燬，這可能不是您想要的。爲了防止這種情況，你可以試試這個塊。

你正在處理的「干擾線」的性質是什麼？我們可以看到例外文本嗎？

來源

2015-10-02 15:56:17 user2993124

我已將錯誤輸出添加到問題中。 – brittenb

如何在Python中正確使用多處理模塊？

回答

相關問題