我對PDF編碼知之甚少,但我認爲您可以通過修改pdf.py
來解決您的特定問題。在PageObject.extractText
方法,你看這是怎麼回事:
def extractText(self):
[...]
for operands,operator in content.operations:
if operator == "Tj":
_text = operands[0]
if isinstance(_text, TextStringObject):
text += _text
elif operator == "T*":
text += "\n"
elif operator == "'":
text += "\n"
_text = operands[0]
if isinstance(_text, TextStringObject):
text += operands[0]
elif operator == '"':
_text = operands[2]
if isinstance(_text, TextStringObject):
text += "\n"
text += _text
elif operator == "TJ":
for i in operands[0]:
if isinstance(i, TextStringObject):
text += i
如果運營商是Tj
或TJ
(它的TJ在你的榜樣PDF)則將文本簡單的添加和不添加換行符。現在你不一定想要添加一個換行符,至少如果我正在閱讀PDF參考權限:Tj/TJ
只是單個和多個顯示字符串操作符,並且某種分隔符的存在不是強制性的。
無論如何,如果你修改這個代碼是這樣的
def extractText(self, Tj_sep="", TJ_sep=""):
[...]
if operator == "Tj":
_text = operands[0]
if isinstance(_text, TextStringObject):
text += Tj_sep
text += _text
[...]
elif operator == "TJ":
for i in operands[0]:
if isinstance(i, TextStringObject):
text += TJ_sep
text += i
則默認行爲應該是相同的:
In [1]: pdf.getPage(1).extractText()[1120:1250]
Out[1]: u'ing an individual which, because of name, identifyingnumber, mark or description can be readily associated with a particular indiv'
,但你可以改變它,當你想:
In [2]: pdf.getPage(1).extractText(Tj_sep=" ")[1120:1250]
Out[2]: u'ta" means any information concerning an individual which, because of name, identifying number, mark or description can be readily '
或
In [3]: pdf.getPage(1).extractText(Tj_sep="\n")[1120:1250]
Out[3]: u'ta" means any information concerning an individual which, because of name, identifying\nnumber, mark or description can be readily '
或者,你可以簡單地自己通過修改操作數本身就地添加分隔符,但可能打破其他的東西(像get_original_bytes
這樣的方法讓我感到緊張)。
最後,如果您不想編輯pdf.py
本身,您可以簡單地將此方法拖出一個函數。