2017-09-14 131 views
0

我想從pdf文檔中的某些表中提取信息。
考慮輸入:如何使用PDFMiner從PDF中提取表格?

Title 1 
some text some text some text some text some text 
some text some text some text some text some text 

Table Title 
| Col1   | Col2 | Col3 | 
|---------------|---------|---------| 
| val11   | val12 | val13 | 
| val21   | val22 | val23 | 
| val31   | val32 | val33 | 

Title 2 
some more text some more text some more text some more text 
some more text 
some more text some more text some more text some more text 

我能得到的輪廓/標題爲這樣:

path='myFile.pdf' 
# Open a PDF file. 
fp = open(path, 'rb') 
# Create a PDF parser object associated with the file object. 
parser = PDFParser(fp) 
# Create a PDF document object that stores the document structure. 
# Supply the password for initialization. 
document = PDFDocument(parser, '') 
outlines = document.get_outlines() 
for (level,title,dest,a,se) in outlines: 
    print (level, title) 

這給了我:

(1, u'Title 1') 
(2, u'Table Title') 
(1, u'Title 2') 

這是完美的,因爲水平對齊文本層次結構。現在,我可以如下提取文本:

if not document.is_extractable: 
    raise PDFTextExtractionNotAllowed 
# Create a PDF resource manager object that stores shared resources. 
rsrcmgr = PDFResourceManager() 
# Create a PDF device object. 
laparams = LAParams() 
device = PDFPageAggregator(rsrcmgr, laparams=laparams) 
# Create a PDF interpreter object. 
interpreter = PDFPageInterpreter(rsrcmgr, device) 
# Process each page contained in the document. 
text_from_pdf = open('textFromPdf.txt','w') 
for page in PDFPage.create_pages(document): 
    interpreter.process_page(page) 
    layout = device.get_result() 
    for element in layout: 
     if isinstance(element, LTTextBox): 
      text_from_pdf.write(''.join([i if ord(i) < 128 else ' ' 
              for i in element.get_text()])) 

這給了我:

Title 1 
some text some text some text some text some text some text some text 
some text some text some text some text some text some text some text 
Table Title 
Col1 
val11 
val12 
val13 
Col2 
val21 
val22 
val23 
Col3 
val31 
val32 
val33 
Title 2 
some more text some more text some more text some more text 
some more text 
some more text some more text some more text some more text 

爲表中的列式的方式提取這是一個有點怪異。我可以逐行獲取表格嗎?而且,我怎樣才能確定桌子的開始和結束?

+1

如果您可以逐列提取表格並將其存儲到2D列表(列表列表)中,那麼您應該可以將其轉置爲逐行格式。這通常使用內置的['zip()'](https://docs.python.org/3/library/functions.html#zip)函數完成。至於查找表格的結尾,您需要查看是否可以檢測到格式中的某種更改。 – martineau

+0

謝謝,但問題是我不知道表開始在哪裏。我的文檔中的任何標題都可能表示一張表格。我怎麼知道? – AbtPst

+1

如果只有一個pdf文檔來源,那麼表格的構建方式可能會有所不同。如果你能弄清楚你的代碼並觀察它。不幸的是,我不認爲PDF文件有任何形式的「表」元素,所以做這樣的事情可能是你唯一的追求...... – martineau

回答

0

如果你只想從PDF文檔中提取表,然後看看這個答案:How to extract table as text from the PDF using Python?

從這個問題的答案,我試圖tabula-py與的分佈在多頁PDF數據表爲我工作。 tabula-py正確跳過所有頁眉和頁腳。以前,我曾在這種類型的文檔上嘗試過PDFMiner,而且我遇到了同樣的問題,有時甚至更糟。