pyPdf忽略PDF文件的換行符

我想一個PDF的每一頁中提取作爲一個字符串：pyPdf忽略PDF文件的換行符

import pyPdf 

pages = [] 
pdf = pyPdf.PdfFileReader(file('g-reg-101.pdf', 'rb')) 
for i in range(0, pdf.getNumPages()): 
    this_page = pdf.getPage(i).extractText() + "\n" 
    this_page = " ".join(this_page.replace(u"\xa0", " ").strip().split()) 
    pages.append(this_page.encode("ascii", "xmlcharrefreplace")) 
for page in pages: 
    print '*' * 80 
    print page

但是這個劇本忽略換行符，留下我凌亂的串像information concerning an individual which, because of name, identifyingnumber, mark or description（即，這應該閱讀identifying number，而不是identifyingumber）。

Here's an example我試圖解析的PDF類型。

來源

2012-06-13 Joe Mornin

我對PDF編碼知之甚少，但我認爲您可以通過修改pdf.py來解決您的特定問題。在PageObject.extractText方法，你看這是怎麼回事：

def extractText(self): 
    [...] 
    for operands,operator in content.operations: 
     if operator == "Tj": 
      _text = operands[0] 
      if isinstance(_text, TextStringObject): 
       text += _text 
     elif operator == "T*": 
      text += "\n" 
     elif operator == "'": 
      text += "\n" 
      _text = operands[0] 
      if isinstance(_text, TextStringObject): 
       text += operands[0] 
     elif operator == '"': 
      _text = operands[2] 
      if isinstance(_text, TextStringObject): 
       text += "\n" 
       text += _text 
     elif operator == "TJ": 
      for i in operands[0]: 
       if isinstance(i, TextStringObject): 
        text += i

如果運營商是Tj或TJ（它的TJ在你的榜樣PDF）則將文本簡單的添加和不添加換行符。現在你不一定想要添加一個換行符，至少如果我正在閱讀PDF參考權限：Tj/TJ只是單個和多個顯示字符串操作符，並且某種分隔符的存在不是強制性的。

無論如何，如果你修改這個代碼是這樣的

def extractText(self, Tj_sep="", TJ_sep=""):

[...]

 if operator == "Tj": 
      _text = operands[0] 
      if isinstance(_text, TextStringObject): 
       text += Tj_sep 
       text += _text

[...]

 elif operator == "TJ": 
      for i in operands[0]: 
       if isinstance(i, TextStringObject): 
        text += TJ_sep 
        text += i

則默認行爲應該是相同的：

In [1]: pdf.getPage(1).extractText()[1120:1250] 
Out[1]: u'ing an individual which, because of name, identifyingnumber, mark or description can be readily associated with a particular indiv'

，但你可以改變它，當你想：

In [2]: pdf.getPage(1).extractText(Tj_sep=" ")[1120:1250] 
Out[2]: u'ta" means any information concerning an individual which, because of name, identifying number, mark or description can be readily '

或

In [3]: pdf.getPage(1).extractText(Tj_sep="\n")[1120:1250] 
Out[3]: u'ta" means any information concerning an individual which, because of name, identifying\nnumber, mark or description can be readily '

或者，你可以簡單地自己通過修改操作數本身就地添加分隔符，但可能打破其他的東西（像get_original_bytes這樣的方法讓我感到緊張）。

最後，如果您不想編輯pdf.py本身，您可以簡單地將此方法拖出一個函數。

來源

2012-06-19 18:55:26 DSM

pyPdf是不是真的爲這種文本提取製成，嘗試pdfminer（或使用pdftotext或類似的東西，如果你不介意創建另一個進程）

來源

2012-06-26 14:27:18 Steven

pyPdf忽略PDF文件的換行符

回答

相關問題