用PyQt5和QWebEngineView刮取javascript頁面

我正在嘗試將一個javascripted網頁呈現爲填充HTML以供刮取。研究不同的解決方案（硒，逆向工程的網頁等），使我知道了技術，但我不能得到它的工作。順便說一句，我是python的新手，基本上在剪切/粘貼/實驗階段。通過安裝和縮進問題，但我現在卡住了。用PyQt5和QWebEngineView刮取javascript頁面

在下面的測試代碼中，print（sample_html）工作並返回目標頁面的原始html，但print（render（sample_html））總是返回單詞'None'。

有趣的是，如果你在amazon.com上運行它，他們會發現它不是一個真正的瀏覽器，並且返回一個關於自動訪問的警告。然而，其他測試頁面提供了真正的HTML應呈現，除非它不。

如何解決的結果總是返回「無「

def render(source_html): 
    """Fully render HTML, JavaScript and all.""" 

    import sys 
    from PyQt5.QtWidgets import QApplication 
    from PyQt5.QtWebEngineWidgets import QWebEngineView 

    class Render(QWebEngineView): 
     def __init__(self, html): 
      self.html = None 
      self.app = QApplication(sys.argv) 
      QWebEngineView.__init__(self) 
      self.loadFinished.connect(self._loadFinished) 
      self.setHtml(html) 
      self.app.exec_() 

     def _loadFinished(self, result): 
      # This is an async call, you need to wait for this 
      # to be called before closing the app 
      self.page().toHtml(self.callable) 

     def callable(self, data): 
      self.html = data 
      # Data has been stored, it's safe to quit the app 
      self.app.quit() 

      return Render(source_html).html 

import requests 
#url = 'http://webscraping.com' 
#url='http://www.amazon.com' 
url='https://www.ncbi.nlm.nih.gov/nuccore/CP002059.1' 
sample_html = requests.get(url).text 
print(sample_html) 
print(render(sample_html))

編輯：感謝其分別併入代碼，但現在它返回一個錯誤，並且腳本掛起，直到我殺的響應。蟒蛇發射然後使段錯誤：

這是修改後的代碼：

def render(source_url): 
"""Fully render HTML, JavaScript and all.""" 

import sys 
from PyQt5.QtWidgets import QApplication 
from PyQt5.QtCore import QUrl 
from PyQt5.QtWebEngineWidgets import QWebEngineView 

class Render(QWebEngineView): 
    def __init__(self, url): 
     self.html = None 
     self.app = QApplication(sys.argv) 
     QWebEngineView.__init__(self) 
     self.loadFinished.connect(self._loadFinished) 
     #self.setHtml(html) 
     self.load(QUrl(url)) 
     self.app.exec_() 

    def _loadFinished(self, result): 
     # This is an async call, you need to wait for this 
     # to be called before closing the app 
     self.page().toHtml(self._callable) 

    def _callable(self, data): 
     self.html = data 
     # Data has been stored, it's safe to quit the app 
     self.app.quit() 

return Render(source_url).html 

#url = 'http://webscraping.com' 
#url='http://www.amazon.com' 
url="https://www.ncbi.nlm.nih.gov/nuccore/CP002059.1" 
print(render(url))

會拋出這些錯誤：

$ python3 -tt fees-pkg-v2.py 
Traceback (most recent call last): 
    File "fees-pkg-v2.py", line 30, in _callable 
    self.html = data 
AttributeError: 'method' object has no attribute 'html' 
None (hangs here until force-quit python launcher) 
Segmentation fault: 11 
$

我已經開始閱讀python類，以充分理解我在做什麼（總是一件好事）。我在想我的環境中的東西可能是問題（OSX Yosemite，Python 3.4.3，Qt5.4.1，sip-4.16.6）。還有其他建議嗎？

來源

2017-07-23 Russ

看起來像'render'的return語句沒有正確縮進，它應該與上面的類位於同一級 – PRMoureu

讓'QWebEngineView'爲您完成所有工作，不需要使用'requests'。 QWebEngineView有一個帶有URL的[load]（http://doc.qt.io/qt-5/qwebengineview.html#load）方法。 –

亞馬遜會在您第一次點擊時檢測到您是刮板，除非您欺騙了您的請求標題。你可以使用像https://pypi.python.org/pypi/fake-useragent之類的東西 –

問題在於環境。我已經手動安裝了Python 3.4.3，Qt5.4.1和sip-4.16.6，並且必須有一些東西。安裝Anaconda後，腳本開始工作。再次感謝。

來源

2017-07-24 14:14:22 Russ

用PyQt5和QWebEngineView刮取javascript頁面

回答

相關問題