我如何申請我的Python代碼的所有文件夾中的一次，我怎麼創建的每個後續輸出文件的新名稱？

我正在使用的代碼採用.pdf文件，並輸出.txt文件。我的問題是，如何創建一個循環（可能是一個for循環），它以一個以.pdf結尾的文件夾中的所有文件一遍又一遍地運行代碼？此外，我怎麼每次循環運行時改變輸出，這樣我可以每次都具有相同的名稱作爲輸入文件（即1_pet.pdf> 1_pet.txt，2_pet.pdf> 2_pet寫一個新的文件。 TXT等）我如何申請我的Python代碼的所有文件夾中的一次，我怎麼創建的每個後續輸出文件的新名稱？

這裏是到目前爲止的代碼：

path="2_pet.pdf" 
content = getPDFContent(path) 
encoded = content.encode("utf-8") 
text_file = open("Output.txt", "w") 
text_file.write(encoded) 
text_file.close()

來源

2015-07-21 Jack Bunce

可能重複（http://stackoverflow.com/ques tions/3964681/find-all-files-in-directory-with-extension-txt-with-python） –

創建函數封裝你想要做的每個文件的內容。

import os.path 

def parse_pdf(filename): 
    "Parse a pdf into text" 
    content = getPDFContent(filename) 
    encoded = content.encode("utf-8") 
    ## split of the pdf extension to add .txt instead. 
    (root, _) = os.path.splitext(filename) 
    text_file = open(root + ".txt", "w") 
    text_file.write(encoded) 
    text_file.close()

那麼此功能適用於文件名列表，像這樣：

for f in files: 
    parse_pdf(f)

來源

2015-07-21 17:47:47 ajerneck

這看起來像它會工作！問題是我需要文件來引用我的目錄。我會這樣做嗎？ 'files =「Users/Jack/Downloads/pyPdf-1.13」' –

您可以使用glob從目錄中獲取文件，就像Rob的回答 – ajerneck

那樣有幫助，並且實際上可以工作（種類）。我現在遇到命名文本文件正在返回的問題，但它們是空白的，當我嘗試執行幾百個文件時，出現錯誤「pyPdf.utils.PdfReadError：EOF marker not found」。你有什麼想法爲什麼這些發生？我非常感謝你的幫助！ –

一到一個目錄上的所有PDF文件進行操作的方法是調用glob.glob()和遍歷結果：

import glob 
for path in glob.glob('*.pdf') 
    content = getPDFContent(path) 
    encoded = content.encode("utf-8") 
    text_file = open("Output.txt", "w") 
    text_file.write(encoded) 
    text_file.close()

另一種方法是允許用戶指定文件：

import sys 
for path in sys.argv[1:]: 
    ...

然後用戶運行腳本像python foo.py *.pdf。

來源

2015-07-21 17:47:15

我只是把它添加到我的代碼中，它運行時沒有返回任何錯誤，但是我的輸出文件只涉及到我的第一個pdf文件。有沒有一個原因可能不會跑過第一個文件？此外，我怎麼去的for循環每次迭代過程中改變輸出鏡像PDF文件的文件名？ –

你可以使用一個遞歸函數來搜索文件夾和所有子文件夾用於與PDF結尾的文件。比拿這些文件，然後爲它創建一個文本文件。

這可能是這樣的：

import os 

def convert_PDF(path, func): 
    d = os.path.basename(path) 
    if os.path.isdir(path): 
     [convert_PDF(os.path.join(path,x), func) for x in os.listdir(path)] 
    elif d[-4:] == '.pdf': 
     funct(path) 

# based entirely on your example code 
def convert_to_txt(path): 
    content = getPDFContent(path) 
    encoded = content.encode("utf-8") 
    file_path = os.path.dirname(path) 
    # replace pdf with txt extension 
    file_name = os.path.basename(path)[:-4]+'.txt' 
    text_file = open(file_path +'/'+file_name, "w") 
    text_file.write(encoded) 
    text_file.close() 

convert_PDF('path/to/files', convert_to_txt)

由於實際操作是多變的，您可以用您需要的任何操作來執行（如使用不同的庫，轉換爲不同的類型等功能。）

來源

2015-07-21 17:55:46 DFenstermacher

下面的腳本解決您的問題：[查找與擴展目錄.TXT與Python的所有文件]

import os 

sourcedir = 'pdfdir' 

dl = os.listdir('pdfdir') 

for f in dl: 
    fs = f.split(".") 
    if fs[1] == "pdf": 
     path_in = os.path.join(dl,f) 
     content = getPDFContent(path_in) 
     encoded = content.encode("utf-8") 
     path_out = os.path.join(dl,fs[0] + ".txt") 
     text_file = open(path_out, 'w') 
     text_file.write(encoded) 
     text_file.close()

來源

2015-07-21 17:59:48 Geeocode

此返回以下錯誤對我來說： 'DL = os.dirlist（「用戶/插座/下載/ pyPdf-1.13」） AttributeError的： '模塊' 對象沒有屬性「dirlist'' –

對不起，listdir同時不dirlist ，我的錯誤，我灌輸了。 – Geeocode

請注意，此代碼解決目錄尋求與接受的答案相反。 – Geeocode

我如何申請我的Python代碼的所有文件夾中的一次，我怎麼創建的每個後續輸出文件的新名稱？

回答

相關問題