如何使用php Goutte和Guzzle抓取數據是否由Javascript加載？

很多次爬行時，我們遇到了使用Javascript生成頁面上呈現的內容的問題，因此scrapy無法抓取它（例如，ajax請求，jQuery）如何使用php Goutte和Guzzle抓取數據是否由Javascript加載？

來源

2016-04-17 Batman

Guzzle（內部使用的Goutte）是一個HTTP客戶端。因此，javascript內容將不會被解析或執行。位於請求端點之外的Javascript文件將不會被下載。

根據您的環境，我想可以利用PHPv8（嵌入Google V8 javascript引擎的PHP擴展）和自定義handler/middleware來執行您想要的操作。

然後再次，根據您的環境，簡單地使用JavaScript客戶端執行刮取可能會更容易。

來源

2016-04-18 21:44:47

-1

因爲它是不可能的JavaScript來的工作，我可以建議另一種解決方案：

Google Chrome瀏覽器>右鍵按鈕>檢查元素>右鍵按鈕>編輯爲html>複製>工作與複製的HTML

 $html = $the_copied_html; 
     $crawler = new Crawler($html); 

     $data = $crawler->filter('.your-selector')->each(function (Crawler $node, $i) { 
       return [ 
        'text' => $node->text() 
       ]; 
     }); 

     //Do whatever you want with the $data 
     return $data; //type Array

這隻適用於單個作業而不適用於自動化過程。在我的情況下，這將做到這一點。

來源

2017-04-17 11:40:23

你想看看phantomjs。有此PHP實現：

http://jonnnnyw.github.io/php-phantomjs/

，如果你需要把它與課程的PHP工作。

您可以閱讀頁面，然後將內容提供給Guzzle，以便使用Guzzle爲您提供的漂亮功能（如搜索內容等）。這將取決於你的需求，也許你可以簡單地使用DOM，就像這樣：

How to get element by class name?

下面是一些工作的代碼。

$content = $this->getHeadlessReponse($url); 
    $this->crawler->addContent($this->getHeadlessReponse($url)); 

    /** 
    * Get response using a headless browser (phantom in this case). 
    * 
    * @param $url 
    * URL to fetch headless 
    * 
    * @return string 
    * Response. 
    */ 
public function getHeadlessReponse($url) { 
    // Fetch with phamtomjs 
    $phantomClient = PhantomClient::getInstance(); 
    // and feed into the crawler. 
    $request = $phantomClient->getMessageFactory()->createRequest($url, 'GET'); 

    /** 
    * @see JonnyW\PhantomJs\Http\Response 
    **/ 
    $response = $phantomClient->getMessageFactory()->createResponse(); 

    // Send the request 
    $phantomClient->send($request, $response); 

    if($response->getStatus() === 200) { 
     // Dump the requested page content 
     return $response->getContent(); 
    } 

}

只有使用虛擬的缺點，它會比狂飲慢，但當然，你必須等待加載所有那些討厭的JS。

來源

2017-07-20 11:01:27

檢查'$ response-> getStatus（）'是否也等於301以防重定向。 – thisiskelvin

如何使用php Goutte和Guzzle抓取數據是否由Javascript加載？

回答

相關問題