0
序言:
I followed this guide。從LXML樹中提取數據
遺憾的是,它不能完全工作,因此我無法從lxml樹中提取我希望的數據。我對這個具體案件並不特別感興趣;我正在尋找更一般的答案。
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from lxml import html
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = 'http://pycoders.com/archive/'
#This does the magic.Loads everything
r = Render(url)
#result is a QString.
result = r.frame.toHtml()
#QString should be converted to string before processed by lxml
formatted_result = str(result.toAscii())
#Next build lxml tree from formatted_result
tree = html.fromstring(formatted_result)
該指南繼續這樣做:
archive_links = tree.xpath('//divass="campaign"]/a/@href')
這將導致一個錯誤:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "src\lxml\lxml.etree.pyx", line 1587, in lxml.etree._Element.xpath (src\lxml\lxml.etree.c:59353)
File "src\lxml\xpath.pxi", line 307, in lxml.etree.XPathElementEvaluator.__call__ (src\lxml\lxml.etree.c:171227)
File "src\lxml\xpath.pxi", line 227, in lxml.etree._XPathEvaluatorBase._handle_result (src\lxml\lxml.etree.c:170184)
lxml.etree.XPathEvalError: Invalid expression
問題
要訪問我的數據,我仍然需要使用正確的XPath的。爲了測試起見,我試過使用title = tree.xpath('//title').
這讓我留下了一個<element title at 0xdf418>
對象。我無法從這個對象中提取數據,即這種情況下的標題。
我已經嘗試了幾件事,但沒有實際返回數據。
>>> title .__len__()
1
>>> title .__sizeof__()
72
>>> type(title)
<type 'list'>
>>>title[0]
<element title at 0xdfc418>
這句法更有意義,但遺憾的是,我將返回'archive_links = []'。 –
@MitchellvanZuylen,這是因爲你只需要初始頁面源代碼就可以獲得鏈接,你需要等到JavaScript執行完成 – Andersson
根據指南,Render類等待JS執行。我誤解了指南,是指導錯誤還是錯過了「渲染」類? –