最好的語言來提取pdf文本，並將其列在行標題下

-2

我基本上想要的是在行標題下的PDF數據或簡化我所說的，我想從PDF文件創建一個數據庫。每個PDF由25 -40頁取決於選民的數量。最好的語言來提取pdf文本，並將其列在行標題下

A page of pdf file I am talking about

我想從箱子中提取數據（或者不管你說什麼他們）到Access/EXCEL/SQL使各盒

名稱下名稱列

關係出現關係列等出現在其他數據下

但是我不知道應該學習哪種編程語言才能做到這一點。我嘗試過搜索PDFminer，但我不確定它是否可以做這個任務或不做。

如果您有任何建議，請讓我知道

來源

2017-05-21 Pawan Sharma

這可能是值得偷看的PDF文件的來源;如果這是以編程方式創建的（因爲它幾乎可以肯定是這樣），頁面源代碼很有可能以常規，可解析的方式進行佈局。即不要將其視爲.pdf，請將其視爲結構化文本文件。 –

...使用文本編輯器而不是acrobat打開.pdf文件。是否有可能通過電子郵件將示例文件發送給我（最好是短文件）？ –

完成發送 –

多很多瞎比我想象的，但它的工作原理：

import csv # spreadsheet output 
import re  # pattern matching 
import sys # command-line arguments 
import zlib # decompression 

# find deflated sections 
PARENT = b"FlateDecode" 
PARENTLEN = len(PARENT) 
START = b"stream\r\n" 
STARTLEN = len(START) 
END = b"\r\nendstream" 
ENDLEN = len(END) 

# find output text in PostScript Tj and TJ fields 
PS_TEXT = re.compile(r"(?<!\\)\((.*?)(?<!\\)\)") 

# return desired per-person records 
RECORD = re.compile(r"Name : (.*?)Relation : (.*?)Address : (.*?)Age : (\d+)\s+Sex : (\w?)\s+(\d+)", re.DOTALL) 

def get_streams(byte_data): 
    streams = [] 
    start_at = 0 
    while True: 
     # find block containing compressed data 
     p = byte_data.find(PARENT, start_at) 
     if p == -1: 
      # no more streams 
      break 
     # find start of data 
     s = byte_data.find(START, p + PARENTLEN) 
     if s == -1: 
      raise ValueError("Found parent at {} bytes with no start".format(p)) 
     # find end of data 
     e = byte_data.find(END, s + STARTLEN) 
     if e == -1: 
      raise ValueError("Found start at {} bytes but no end".format(s)) 
     # unpack compressed data 
     block = byte_data[s + STARTLEN:e] 
     unc = zlib.decompress(block) 
     streams.append(unc) 
     start_at = e + ENDLEN 
    return streams 

def depostscript(text): 
    out = [] 
    for line in text.splitlines(): 
     if line.endswith(" Tj"): 
      # new output line 
      s = "".join(PS_TEXT.findall(line)) 
      out.append(s) 
     elif line.endswith(" TJ"): 
      # continued output line 
      s = "".join(PS_TEXT.findall(line)) 
      out[-1] += s 
    return "\n".join(out) 

def main(in_pdf, out_csv): 
    # load .pdf file into memory 
    with open(in_pdf, "rb") as f: 
     pdf = f.read() 

    # get content of all compressed streams 
    # NB: sample file results in 32 streams 
    streams = get_streams(pdf)  

    # we only want the streams which contain text data 
    # NB: sample file results in 22 text streams 
    text_streams = [] 
    for stream in streams: 
     try: 
      text = stream.decode() 
      text_streams.append(text) 
     except UnicodeDecodeError: 
      pass 

    # of the remaining blocks, we only want those containing the text '(Relation : ' 
    # NB: sample file results in 4 streams 
    text_streams = [t for t in text_streams if '(Relation : ' in t] 

    # consolidate target text 
    # NB: sample file results in 886 lines of text 
    text = "\n".join(depostscript(ts) for ts in text_streams) 

    # pull desired data from text 
    records = [] 
    for name,relation,address,age,sex,num in RECORD.findall(text): 
     name = name.strip() 
     relation = relation.strip() 
     t = address.strip().splitlines() 
     code = t[-1] 
     address = " ".join(t[:-1]) 
     age = int(age) 
     sex = sex.strip() 
     num = int(num) 
     records.append((num, code, name, relation, address, age, sex)) 

    # save results as csv 
    with open(out_csv, "w", newline='') as outf: 
     wr = csv.writer(outf) 
     wr.writerows(records) 

if __name__ == "__main__": 
    if len(sys.argv) < 3: 
     print("Usage: python {} input.pdf output.csv".format(__name__)) 
    else: 
     main(sys.argv[1], sys.argv[2])

當在命令行中像

python myscript.py voters.pdf voters.csv

運行它產生的.csv電子表格像

來源

2017-05-21 19:23:20

我已經做了我需要閱讀PDF並從中提取一些數據的學院工作。我當時使用了PyPDF2。在提取數據時使用字符編碼以及使用Java的人（我不能說他們使用哪個庫）沒有這種問題。

我的建議是，您可以嘗試使用Java或PytoPDF2以外的其他'pypdf'庫。

當談到處理'字符串'數據時，我認爲Python是最好的選擇。

我想你應該考慮一件事。如果你缺乏編程經驗，Python是一門偉大的語言，Java有點恐怖。

來源

2017-05-21 17:52:42

PyMuPDF使它相當容易。我在http://ceodelhi.gov.in/WriteReadData/AssemblyConstituency4/AC13/AC0130022.pdf處理了一個類似的頁面，並以下面的方式應用這個庫來獲得可以用BeautifulSoup或lxml解析的HTML。

>>> import fitz 
>>> doc = fitz.open('AC0130022.pdf') 
>>> page = doc.loadPage(3) 
>>> text = page.getText(output='html') 
>>> len(text) 
52807 
>>> open('page3.html','w').write(text) 
52807

有用於PyMuPDF在https://pythonhosted.org/PyMuPDF/tutorial.html的教程。

來源

2017-05-21 19:47:54

最好的語言來提取pdf文本，並將其列在行標題下

回答

相關問題