2016-02-23 32 views
0

我想通過輸入座標來提取pdf礦工中的文本,我搜索了互聯網,但無法找到任何相關的文檔或代碼。到目前爲止,我發現了一個代碼提取文本並輸出其座標。在pdfminer中輸入座標並獲得結果

LTTextBoxHorizontal 
(317.564, 91.32756, 580.93228, 116.24235999999999) 
SHOULD ANY OF THE ABOVE DESCRIBED POLICIES BE CANCELLED BEFORE 
THE EXPIRATION DATE THEREOF, NOTICE WILL BE DELIVERED IN 
ACCORDANCE WITH THE POLICY PROVISIONS. 

這是我已經獲得的輸出座標和文本之一。我也試過pdfquery但我有很多錯誤。

File "C:\Python27\lib\site-packages\pyquery-1.2.11-py2.7.egg\pyquery\pyquery.py", line 268, in __call__ 
    result = self._copy(*args, parent=self, **kwargs) 
    File "C:\Python27\lib\site-packages\pyquery-1.2.11-py2.7.egg\pyquery\pyquery.py", line 253, in _copy 
    return self.__class__(*args, **kwargs) 
    File "C:\Python27\lib\site-packages\pyquery-1.2.11-py2.7.egg\pyquery\pyquery.py", line 239, in __init__ 
    xpath = self._css_to_xpath(selector) 
    File "C:\Python27\lib\site-packages\pyquery-1.2.11-py2.7.egg\pyquery\pyquery.py", line 249, in _css_to_xpath 
    return self._translator.css_to_xpath(selector, prefix) 
    File "build\bdist.win32\egg\cssselect\xpath.py", line 192, in css_to_xpath 
    File "build\bdist.win32\egg\cssselect\parser.py", line 355, in parse 
    File "build\bdist.win32\egg\cssselect\parser.py", line 370, in parse_selector_group 
    File "build\bdist.win32\egg\cssselect\parser.py", line 378, in parse_selector 
    File "build\bdist.win32\egg\cssselect\parser.py", line 437, in parse_simple_selector 
    File "build\bdist.win32\egg\cssselect\parser.py", line 535, in parse_attrib 
cssselect.parser.SelectorSyntaxError: Expected string or ident, got <NUMBER '1' at 14> 

有人可以幫我嗎?

回答

2

發生這種情況時,您不能逃脫pageid值。

嘗試:

LTPage[pageid=\'1\'] 
+0

你天才的你! – Johnson