pypdf不從pdf中提取表格

我使用pypdf從pdf文件中提取文本。問題是沒有提取PDF文件中的表格。我也嘗試使用pdfminer，但我有同樣的問題。pypdf不從pdf中提取表格

2013-07-08 Omair Shamshir

問題是PDF中的表格通常由絕對定位的行和字符組成，並且將其轉換爲合理的表格表示方式並非易事。

在Python中，PDFMiner可能是您最好的選擇。它爲您提供佈局對象的樹形結構，但您必須通過查看線條（LTLine）和文本框（LTTextBox）的位置來解釋自己的表格。 There's a little bit of documentation here。

或者，PDFX嘗試此操作（並且通常會成功），但是您必須將其用作Web服務（不理想，但適合偶爾的工作）。要做到這一點從Python，你可以做類似如下：

import urllib2 
import xml.etree.ElementTree as ET 

# Make request to PDFX 
pdfdata = open('example.pdf', 'rb').read() 
request = urllib2.Request('http://pdfx.cs.man.ac.uk', pdfdata, headers={'Content-Type' : 'application/pdf'}) 
response = urllib2.urlopen(request).read() 

# Parse the response 
tree = ET.fromstring(response) 
for tbox in tree.findall('.//region[@class="DoCO:TableBox"]'): 
    src = ET.tostring(tbox.find('content/table')) 
    info = ET.tostring(tbox.find('region[@class="TableInfo"]')) 
    caption = ET.tostring(tbox.find('caption'))

來源

2013-07-12 16:02:06

pypdf不從pdf中提取表格

回答

相關問題