Scrapy + Splash + ScrapyJS

我使用的是Splash 2.0.2 + Scrapy 1.0.5 + Scrapyjs 0.1.1，我仍然無法通過點擊呈現JavaScript。下面是一個例子網址https://olx.pt/anuncio/loja-nova-com-250m2-garagem-em-box-fechada-para-arrumos-IDyTzAT.html#c49d3d94cf Scrapy + Splash + ScrapyJS

我仍然沒有得到電話號碼的頁面渲染：

class OlxSpider(scrapy.Spider): 
    name = "olx" 
    rotate_user_agent = True 
    allowed_domains = ["olx.pt"] 
    start_urls = [ 
     "https://olx.pt/imoveis/" 
    ] 

    def parse(self, response): 
     script = """ 
     function main(splash) 
      splash:go(splash.args.url) 
      splash:runjs('document.getElementById("contact_methods").getElementsByTagName("span")[1].click();') 
      splash:wait(0.5) 
      return splash:html() 
     end 
     """ 
     for href in response.css('.link.linkWithHash.detailsLink::attr(href)'): 
      url = response.urljoin(href.extract()) 
      yield scrapy.Request(url, callback=self.parse_house_contents, meta={ 
       'splash': { 
        'args': {'lua_source': script}, 
        'endpoint': 'execute', 
       } 
      }) 

     for next_page in response.css('.pager .br3.brc8::attr(href)'): 
      url = response.urljoin(next_page.extract()) 
      yield scrapy.Request(url, self.parse) 

    def parse_house_contents(self, response): 

     import ipdb;ipdb.set_trace()

我怎樣才能得到這個工作？

來源

2016-03-03 psychok7

您可以避免必須首先使用Splash，併發出相應的GET請求以自行獲取電話號碼。工作蜘蛛：

import json 
import re 

import scrapy 

class OlxSpider(scrapy.Spider): 
    name = "olx" 
    rotate_user_agent = True 
    allowed_domains = ["olx.pt"] 
    start_urls = [ 
     "https://olx.pt/imoveis/" 
    ] 

    def parse(self, response): 
     for href in response.css('.link.linkWithHash.detailsLink::attr(href)'): 
      url = response.urljoin(href.extract()) 
      yield scrapy.Request(url, callback=self.parse_house_contents) 

     for next_page in response.css('.pager .br3.brc8::attr(href)'): 
      url = response.urljoin(next_page.extract()) 
      yield scrapy.Request(url, self.parse) 

    def parse_house_contents(self, response): 
     property_id = re.search(r"ID(\w+)\.", response.url).group(1) 

     phone_url = "https://olx.pt/ajax/misc/contact/phone/%s/" % property_id 
     yield scrapy.Request(phone_url, callback=self.parse_phone) 

    def parse_phone(self, response): 
     phone_number = json.loads(response.body)["value"] 
     print(phone_number)

如果有更多的事情，從這個「動態」網站提取，看看是否飛濺真是夠，如果沒有，看看瀏覽器自動化和selenium。

來源

2016-03-03 19:34:05 alecxe

我真的需要這個工作，因爲我會@ psychok7你肯定scrapyjs就足以被移動到更復雜的JS站點，日期選擇器日曆和東西 – psychok7

爲您的複雜動態網站？也許切換到'硒'會讓事情變得更快，更簡單.. – alecxe

我試了一下..我不知道如果它是可能的或不..但我會考慮硒以及謝謝 – psychok7

添加

splash:autoload("https://code.jquery.com/jquery-2.1.3.min.js")

到Lua腳本，它會工作。

function main(splash) 
    splash:go(splash.args.url) 
    splash:autoload("https://code.jquery.com/jquery-2.1.3.min.js") 
    splash:runjs('document.getElementById("contact_methods").getElementsByTagName("span")[1].click();') 
    splash:wait(0.5) 
    return splash:html() 
end

。點擊（）是jQuery函數https://api.jquery.com/click/

來源

2016-03-05 16:07:24 marvin

Scrapy + Splash + ScrapyJS

回答

相關問題