使用Asyncio的Run_In_Executor包裝Selenium驅動程序（和其他阻止調用）

我正在試驗我的第一個小型Python中的刮板，並且我想使用asyncio同時獲取多個網站。我已經寫了一個可以與aiohttp一起使用的函數，但是由於aiohttp.request（）不執行JavaScript，所以這並不適合用於抓取一些動態網頁。因此，這激發了嘗試將PhantomJS與Selenium一起用作無頭瀏覽器。使用Asyncio的Run_In_Executor包裝Selenium驅動程序（和其他阻止調用）

有幾段代碼演示瞭如何使用BaseEventLoop.run_in_executor - such as here - 但是文檔非常稀疏，我的複製和粘貼mojo功能不夠強大。

如果有人會善意地擴大使用asyncio來打包阻塞調用，或解釋在這種特定情況下發生了什麼，我會很感激！這是我迄今爲止碰到的東西：

@asyncio.coroutine 
def fetch_page_pjs(self, url): 
    ''' 
    (self, string, int) -> None 
    Performs async website content retrieval 
    ''' 
    loop = asyncio.get_event_loop() 
    try: 
     future = loop.run_in_executor(None, self.driver.get, url) 
     print(url) 
     response = yield from future 
     print(response) 
     if response.status == 200: 
      body = BeautifulSoup(self.driver.page_source) 
      self.results.append((url, body)) 
     else: 
      self.results.append((url, '')) 
    except: 
     self.results.append((url, ''))

響應返回'無' - 爲什麼？

來源

2015-05-11 Todd Howe

我碰到這個問題就來了，同時尋找run_in_executor。我不確定我是否有答案，但您是否期待'fetch_page_pjs'函數返回響應？目前它沒有返回任何東西，因此python會返回'None'。它也出現在你使用'self'的時候，這個函數實際上是一個類的一部分，並且你永遠不會真正保存對這個類的響應。這是打算嗎？最後，你所鏈接問題中的第一個答案相當於將'fetch_page_pjs'函數當作'loop.run_until_complete（main（））'，即在一個循環中。你在做這個嗎？ – neRok

這不是一個asyncio或run_in_executor問題。硒api根本無法以這種方式使用。第一個driver.get不返回任何東西。請參閱Docs for selenium。其次，它是不可能直接得到與硒的狀態代碼，請參閱this stack overflow question

此代碼爲我工作：

@asyncio.coroutine 
def fetch_page_pjs(self, url): 
    ''' 
    (self, string, int) -> None 
    Performs async website content retrieval 
    ''' 
    loop = asyncio.get_event_loop() 
    try: 
     future = loop.run_in_executor(None, self.driver.get, url) 
     print(url) 
     yield from future 
     body = BeautifulSoup(self.driver.page_source) 
     self.results.append((url, body)) 

    except: 
     self.results.append((url, ''))

來源

2015-10-08 20:22:37

使用Asyncio的Run_In_Executor包裝Selenium驅動程序（和其他阻止調用）

回答

相關問題