2012-06-13 83 views
4

我想一個PDF的每一頁中提取作爲一個字符串:pyPdf忽略PDF文件的換行符

import pyPdf 

pages = [] 
pdf = pyPdf.PdfFileReader(file('g-reg-101.pdf', 'rb')) 
for i in range(0, pdf.getNumPages()): 
    this_page = pdf.getPage(i).extractText() + "\n" 
    this_page = " ".join(this_page.replace(u"\xa0", " ").strip().split()) 
    pages.append(this_page.encode("ascii", "xmlcharrefreplace")) 
for page in pages: 
    print '*' * 80 
    print page 

但是這個劇本忽略換行符,留下我凌亂的串像information concerning an individual which, because of name, identifyingnumber, mark or description(即,這應該閱讀identifying number,而不是identifyingumber)。

Here's an example我試圖解析的PDF類型。

回答

7

我對PDF編碼知之甚少,但我認爲您可以通過修改pdf.py來解決您的特定問題。在PageObject.extractText方法,你看這是怎麼回事:

def extractText(self): 
    [...] 
    for operands,operator in content.operations: 
     if operator == "Tj": 
      _text = operands[0] 
      if isinstance(_text, TextStringObject): 
       text += _text 
     elif operator == "T*": 
      text += "\n" 
     elif operator == "'": 
      text += "\n" 
      _text = operands[0] 
      if isinstance(_text, TextStringObject): 
       text += operands[0] 
     elif operator == '"': 
      _text = operands[2] 
      if isinstance(_text, TextStringObject): 
       text += "\n" 
       text += _text 
     elif operator == "TJ": 
      for i in operands[0]: 
       if isinstance(i, TextStringObject): 
        text += i 

如果運營商是TjTJ(它的TJ在你的榜樣PDF)則將文本簡單的添加和不添加換行符。現在你不一定想要添加一個換行符,至少如果我正在閱讀PDF參考權限:Tj/TJ只是單個和多個顯示字符串操作符,並且某種分隔符的存在不是強制性的。

無論如何,如果你修改這個代碼是這樣的

def extractText(self, Tj_sep="", TJ_sep=""): 

[...]

 if operator == "Tj": 
      _text = operands[0] 
      if isinstance(_text, TextStringObject): 
       text += Tj_sep 
       text += _text 

[...]

 elif operator == "TJ": 
      for i in operands[0]: 
       if isinstance(i, TextStringObject): 
        text += TJ_sep 
        text += i 

則默認行爲應該是相同的:

In [1]: pdf.getPage(1).extractText()[1120:1250] 
Out[1]: u'ing an individual which, because of name, identifyingnumber, mark or description can be readily associated with a particular indiv' 

,但你可以改變它,當你想:

In [2]: pdf.getPage(1).extractText(Tj_sep=" ")[1120:1250] 
Out[2]: u'ta" means any information concerning an individual which, because of name, identifying number, mark or description can be readily ' 

In [3]: pdf.getPage(1).extractText(Tj_sep="\n")[1120:1250] 
Out[3]: u'ta" means any information concerning an individual which, because of name, identifying\nnumber, mark or description can be readily ' 

或者,你可以簡單地自己通過修改操作數本身就地添加分隔符,但可能打破其他的東西(像get_original_bytes這樣的方法讓我感到緊張)。

最後,如果您不想編輯pdf.py本身,您可以簡單地將此方法拖出一個函數。

0

pyPdf是不是真的爲這種文本提取製成,嘗試pdfminer(或使用pdftotext或類似的東西,如果你不介意創建另一個進程)