我正在嘗試將一個javascripted網頁呈現爲填充HTML以供刮取。研究不同的解決方案(硒,逆向工程的網頁等),使我知道了技術,但我不能得到它的工作。順便說一句,我是python的新手,基本上在剪切/粘貼/實驗階段。通過安裝和縮進問題,但我現在卡住了。用PyQt5和QWebEngineView刮取javascript頁面
在下面的測試代碼中,print(sample_html)工作並返回目標頁面的原始html,但print(render(sample_html))總是返回單詞'None'。
有趣的是,如果你在amazon.com上運行它,他們會發現它不是一個真正的瀏覽器,並且返回一個關於自動訪問的警告。然而,其他測試頁面提供了真正的HTML應呈現,除非它不。
如何解決的結果總是返回「無「
def render(source_html):
"""Fully render HTML, JavaScript and all."""
import sys
from PyQt5.QtWidgets import QApplication
from PyQt5.QtWebEngineWidgets import QWebEngineView
class Render(QWebEngineView):
def __init__(self, html):
self.html = None
self.app = QApplication(sys.argv)
QWebEngineView.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.setHtml(html)
self.app.exec_()
def _loadFinished(self, result):
# This is an async call, you need to wait for this
# to be called before closing the app
self.page().toHtml(self.callable)
def callable(self, data):
self.html = data
# Data has been stored, it's safe to quit the app
self.app.quit()
return Render(source_html).html
import requests
#url = 'http://webscraping.com'
#url='http://www.amazon.com'
url='https://www.ncbi.nlm.nih.gov/nuccore/CP002059.1'
sample_html = requests.get(url).text
print(sample_html)
print(render(sample_html))
編輯:感謝其分別併入代碼,但現在它返回一個錯誤,並且腳本掛起,直到我殺的響應。蟒蛇發射然後使段錯誤:
這是修改後的代碼:
def render(source_url):
"""Fully render HTML, JavaScript and all."""
import sys
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
from PyQt5.QtWebEngineWidgets import QWebEngineView
class Render(QWebEngineView):
def __init__(self, url):
self.html = None
self.app = QApplication(sys.argv)
QWebEngineView.__init__(self)
self.loadFinished.connect(self._loadFinished)
#self.setHtml(html)
self.load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
# This is an async call, you need to wait for this
# to be called before closing the app
self.page().toHtml(self._callable)
def _callable(self, data):
self.html = data
# Data has been stored, it's safe to quit the app
self.app.quit()
return Render(source_url).html
#url = 'http://webscraping.com'
#url='http://www.amazon.com'
url="https://www.ncbi.nlm.nih.gov/nuccore/CP002059.1"
print(render(url))
會拋出這些錯誤:
$ python3 -tt fees-pkg-v2.py
Traceback (most recent call last):
File "fees-pkg-v2.py", line 30, in _callable
self.html = data
AttributeError: 'method' object has no attribute 'html'
None (hangs here until force-quit python launcher)
Segmentation fault: 11
$
我已經開始閱讀python類,以充分理解我在做什麼(總是一件好事)。我在想我的環境中的東西可能是問題(OSX Yosemite,Python 3.4.3,Qt5.4.1,sip-4.16.6)。還有其他建議嗎?
看起來像'render'的return語句沒有正確縮進,它應該與上面的類位於同一級 – PRMoureu
讓'QWebEngineView'爲您完成所有工作,不需要使用'requests'。 QWebEngineView有一個帶有URL的[load](http://doc.qt.io/qt-5/qwebengineview.html#load)方法。 –
亞馬遜會在您第一次點擊時檢測到您是刮板,除非您欺騙了您的請求標題。你可以使用像https://pypi.python.org/pypi/fake-useragent之類的東西 –