PDF抓取：如何自動爲Python中每個pdf所創建的txt文件創建？

下面是我想要做的：一個程序，將PDF文件列表作爲其輸入，併爲列表中的每個文件返回一個.txt文件。例如，給定一個listA = [「file1.pdf」，「file2.pdf」，「file3.pdf」]，我想讓Python創建三個txt文件（每個pdf文件一個），比如說「file1 .txt「，」file2.txt「和」file3.txt「。PDF抓取：如何自動爲Python中每個pdf所創建的txt文件創建？

由於this guy，我的轉換部分可以正常工作。我所做的唯一更改是在maxpages語句中，爲了僅提取第一頁，我爲其指定了1而不是0。正如我所說，我的代碼的這部分工作完美。這是代碼。

def convert_pdf_to_txt(path): 
rsrcmgr = PDFResourceManager() 
retstr = StringIO() 
codec = 'utf-8' 
laparams = LAParams() 
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams) 
fp = file(path, 'rb') 
interpreter = PDFPageInterpreter(rsrcmgr, device) 
password = "" 
#maxpages = 0 
maxpages = 1 
caching = True 
pagenos=set() 
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True): 
    interpreter.process_page(page) 
fp.close() 
device.close() 
str = retstr.getvalue() 
retstr.close() 
return str

事情是我似乎沒有Python返回我是我在第二段說的。我試過下面的代碼：

def save(lst): 
i = 0 

while i < len(lst): 
    txtfile = "enegep"+str(i)+".txt" #enegep is like the identifier of the files 
    artigo = convert_pdf_to_txt(lst[0]) 
    with open(txtfile, "w") as textfile: 
     textfile.write(artigo) 
    i += 1

我跑了保存功能與兩個PDF文件作爲輸入列表，但它僅生成一個txt文件，並保持運行幾分鐘，而不會產生第二個txt文件。什麼是更好的方法來實現我的目標？

來源

2015-02-17 iatowks

你不更新i所以你的代碼卡住在一個無限循環，你需要i += 1：

def save(lst): 
    i = 0 # set to 0 but never changes 
    while i < len(lst): 
     txtfile = "enegep"+str(i)+".txt" #enegep is like the identifier of the files 
     artigo = convert_pdf_to_txt(lista[0]) 
     with open(txtfile, "w") as textfile: 
      textfile.write(artigo) 
    i += 1 # you need to increment i

一個更好的選擇是簡單地使用range：

def save(lst): 
    for i in range(len(lst)): 
     txtfile = "enegep{}.txt".format(i) #enegep is like the identifier of the files 
     artigo = convert_pdf_to_txt(lista[0]) 
     with open(txtfile, "w") as textfile: 
      textfile.write(artigo)

也僅使用lista[0]，因此您可能還需要更改該代碼，以便在每次迭代時在列表中移動。

如果地表溫度實際上是利斯塔你可以使用enumerate：

def save(lst): 
     for i, ele in enumerate(lst): 
      txtfile = "enegep{}.txt".format(i) #enegep is like the identifier of the files 
      artigo = convert_pdf_to_txt(ele) 
      with open(txtfile, "w") as textfile: 
       textfile.write(artigo)

來源

2015-02-17 23:02:16

對不起，我不知道我的代碼已經在這裏張貼之前的一些錯別字和小錯誤。我只是修復它們。順便說一句，「lista」的意思是葡萄牙語的列表。編輯：第二個完美的工作，非常感謝你。 – iatowks 2015-02-17 23:31:27

@iatowks，你仍然需要使用more than lista [0]，你確定你在正確的地方有i + = 1嗎？試試我提供的最後一個代碼 – 2015-02-17 23:35:05

我使用了你在我的代碼中寫的第三個選項，它給了我預期的結果。再次，非常感謝。 – iatowks 2015-02-17 23:57:16

PDF抓取：如何自動爲Python中每個pdf所創建的txt文件創建？

回答

相關問題