2016-05-21 50 views
-2

這是我在這裏找到的代碼。我不知道如何使用它。有人可以通過這個來幫助我轉換樣本pdf嗎?我想使用PDFminer從PDF中將文本提取到.text文件。我找到了代碼,但我不知道如何使用它

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter 
from pdfminer.converter import TextConverter 
from pdfminer.layout import LAParams 
from pdfminer.pdfpage import PDFPage 
from cStringIO import StringIO 

def convert_pdf_to_txt(path): 
    rsrcmgr = PDFResourceManager() 
    retstr = StringIO() 
    codec = 'utf-8' 
    laparams = LAParams() 
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams) 
    fp = file(path, 'rb') 
    interpreter = PDFPageInterpreter(rsrcmgr, device) 
    password = "" 
    maxpages = 0 
    caching = True 
    pagenos=set() 

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True): 
     interpreter.process_page(page) 

    text = retstr.getvalue() 

    fp.close() 
    device.close() 
    retstr.close() 
    return text 
+0

PDfminer使用你嘗試運行呢? – glls

+0

是的,我有。它什麼也沒做。 – iMiner

+0

和即時通訊假設你從這裏提取代碼? https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167 – glls

回答

2

如果使用pdfminer並從他們的頁面中使用的代碼和閱讀他們的文檔https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167

from cStringIO import StringIO 
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter 
from pdfminer.converter import TextConverter 
from pdfminer.layout import LAParams 
from pdfminer.pdfpage import PDFPage 

def convert(fname, pages=None): 
    if not pages: 
     pagenums = set() 
    else: 
     pagenums = set(pages) 

    output = StringIO() 
    manager = PDFResourceManager() 
    converter = TextConverter(manager, output, laparams=LAParams()) 
    interpreter = PDFPageInterpreter(manager, converter) 

    infile = file(fname, 'rb') 
    for page in PDFPage.get_pages(infile, pagenums): 
     interpreter.process_page(page) 
    infile.close() 
    converter.close() 
    text = output.getvalue() 
    output.close 
    return text 

我不認爲你應該有使用任何麻煩:

高清轉換(FNAME, pages = None):它基本上可以爲你轉換pdf格式

使用方法如下:

some_variable = convert("filename.pdf") 
print(some_variable) 
#do something with your variable 

使用例如PDF: enter image description here

+0

這工作...有點。這是輸出: 這是PDF的 原來的PDF說「這是PDF」,但蟒蛇顯示「這是PDF的」 – iMiner

+0

是PDF公共,在中,你能夠分享它嗎? – glls

+0

https://drive.google.com/file/d/0B5eGq9boXZxARWJLX0pDb1RaX2s/view?usp=分享我的谷歌驅動器。我認爲,既然我已經分享了它,你可以下載它。 – iMiner

0

終於讓我找到一種方法來此。最好的庫是PDfminer,對pdf2txt.py進行少量修改以實現有效的使用。 pdf2text.py位於pdfminer /工具

安裝在終端

pip install PDfminer 

from cStringIO import StringIO 
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter 
from pdfminer.converter import TextConverter 
from pdfminer.layout import LAParams 
from pdfminer.pdfpage import PDFPage 
import re 

def convert(fname): 
    pages=None 
    if not pages: 
     pagenums = set() 
    else: 
     pagenums = set(pages) 

    output = StringIO() 
    manager = PDFResourceManager() 
    converter = TextConverter(manager, output, laparams=LAParams()) 
    interpreter = PDFPageInterpreter(manager, converter) 

    infile = file(fname, 'rb') 
    for page in PDFPage.get_pages(infile, pagenums): 
     interpreter.process_page(page) 
    infile.close() 
    converter.close() 
    text = output.getvalue() 
    output.close 
    print text 

    # write Content to .txt 
    text_file = open("Output_1.txt", "w") 
    text = re.sub("\s\s+", " ", text) 
    text_file.write("%s" % text) 
    text_file.close() 

convert("xyz.pdf") 
相關問題