如何檢測頁面是否大量使用JavaScript，Python，Scrapy和Selenium？

我在Selenium的幫助下編寫了一個Scrapy蜘蛛來處理網頁上的Javascript內容。但是，我意識到這個蜘蛛比普通的Scrapy Crawler慢得多。由於這個原因，我想結合兩個蜘蛛：常見的CrawlSpider獲得所有資源和一個Selenium蜘蛛只是爲了廣泛使用JavaScript的頁面。我創建了 pipleline步驟，嘗試檢測網頁是否需要JavaScript並大量使用它。到目前爲止我對處理步驟的想法失敗：如何檢測頁面是否大量使用JavaScript，Python，Scrapy和Selenium？

某些頁面使用常見的<noscript>標記。
某些頁面打印警告消息，例如<div class="yt-alert-message" >。
...

有這麼多不同的方式來表明一個頁面需要安裝Javascript！

你知道一個標準化的方式，我怎麼能「檢測」，這廣泛使用 JavaScript的網頁？

注：我只想要處理我的硒蜘蛛網頁，確實有必要如蜘蛛顯著慢，一些網頁只用它的一個不錯的設計。

來源

2013-05-13 Jon

您是否嘗試過[機械化]（http://wwwsearch.sourceforge.net/mechanize/）這些廣泛使用js頁面？ – alecxe 2013-05-13 21:32:41

您可以從腳本標記中獲取所有JavaScript，將其全部添加，並檢查長度不超過您認爲構成「大量」JavaScript的數量。

# get all script tags 
scripts = browser.find_elements_by_tag_name("script") 

# create a string to add all the JS content to 
javaScriptChars = ""; 

# create an list to store urls for external scripts 
urls = list() 

# for each script on the page... 
for script in scripts 

    # get the src 
    url = script.get_attribute("scr") 

    # if script is external (has a 'src' attribute)... 
    if url.__len__() > 0: 

     # add the url to the list (will access it later) 
     urls.append(url) 

    else: 

     # the script is inline - so just get the text inside 
     javaScriptChars = javaScriptChars + script.getAttribute("textContent"); 

# for each external url found above... 
for url in urls 

    # open the script 
    driver.get(url) 

    # add the content to our string 
    javaScriptChars = javaScriptChars + driver.page_source 

# check if the string is longer than some threshold you choose        
if javaScriptChars.__len__() > 50000: 
    # JS contains more than 5000 characters

該數字是任意的。我猜JS的不到50000個字符實際上可能不是「很多」，因爲該頁面可能不會每次調用每個函數。這可能會取決於用戶的用途。

但是，如果您可以假設精心設計的網站只包含必要的腳本，那麼字符數仍然可以作爲它運行多少JS的相關指標。

來源

2013-05-24 18:58:52

如何檢測頁面是否大量使用JavaScript，Python，Scrapy和Selenium？

回答

相關問題