0
我想從pdf文檔中的某些表中提取信息。
考慮輸入:如何使用PDFMiner從PDF中提取表格?
Title 1
some text some text some text some text some text
some text some text some text some text some text
Table Title
| Col1 | Col2 | Col3 |
|---------------|---------|---------|
| val11 | val12 | val13 |
| val21 | val22 | val23 |
| val31 | val32 | val33 |
Title 2
some more text some more text some more text some more text
some more text
some more text some more text some more text some more text
我能得到的輪廓/標題爲這樣:
path='myFile.pdf'
# Open a PDF file.
fp = open(path, 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
document = PDFDocument(parser, '')
outlines = document.get_outlines()
for (level,title,dest,a,se) in outlines:
print (level, title)
這給了我:
(1, u'Title 1')
(2, u'Table Title')
(1, u'Title 2')
這是完美的,因爲水平對齊文本層次結構。現在,我可以如下提取文本:
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
text_from_pdf = open('textFromPdf.txt','w')
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
layout = device.get_result()
for element in layout:
if isinstance(element, LTTextBox):
text_from_pdf.write(''.join([i if ord(i) < 128 else ' '
for i in element.get_text()]))
這給了我:
Title 1
some text some text some text some text some text some text some text
some text some text some text some text some text some text some text
Table Title
Col1
val11
val12
val13
Col2
val21
val22
val23
Col3
val31
val32
val33
Title 2
some more text some more text some more text some more text
some more text
some more text some more text some more text some more text
爲表中的列式的方式提取這是一個有點怪異。我可以逐行獲取表格嗎?而且,我怎樣才能確定桌子的開始和結束?
如果您可以逐列提取表格並將其存儲到2D列表(列表列表)中,那麼您應該可以將其轉置爲逐行格式。這通常使用內置的['zip()'](https://docs.python.org/3/library/functions.html#zip)函數完成。至於查找表格的結尾,您需要查看是否可以檢測到格式中的某種更改。 – martineau
謝謝,但問題是我不知道表開始在哪裏。我的文檔中的任何標題都可能表示一張表格。我怎麼知道? – AbtPst
如果只有一個pdf文檔來源,那麼表格的構建方式可能會有所不同。如果你能弄清楚你的代碼並觀察它。不幸的是,我不認爲PDF文件有任何形式的「表」元素,所以做這樣的事情可能是你唯一的追求...... – martineau