2016-01-28 51 views
3

我想獲得它使用JavaScript像使用scrapyjs抓取飛濺

<span onclick="go1()">click here </span> 
<script>function go1(){ 
     window.location = "../innerpages/" + myname + ".php"; 
    } 
</script> 

從網頁URL的onclick頁面,這是使用scrapyjs飛濺

def start_requests(self): 
    for url in self.start_urls: 
     yield Request(url, self.parse, meta={ 
      'splash': { 
       'endpoint': 'render.html', 
       'args': {'wait': 4, 'html': 1, 'png': 1, 'render_all': 1, 'js_source': 'document.getElementsByTagName("span")[0].click()'}, 
      } 
     }) 

如果我寫我的代碼

'js_source': 'document.title="hello world"' 

它會工作

是好像我能處理頁面中的文本,但我無法從go1()

我應該怎麼做,如果我想進去go1()

感謝網址,以獲取網址!

回答

4

可以使用/execute endpoint

class MySpider(scrapy.Spider): 
    ... 

    def start_requests(self): 
     script = """ 
     function main(splash) 
      local url = splash.args.url 
      assert(splash:go(url)) 
      assert(splash:wait(1)) 

      assert(splash:runjs('document.getElementsByTagName("span")[0].click()')) 
      assert(splash:wait(1)) 

      -- return result as a JSON object 
      return { 
       html = splash:html() 
      } 
     end 
     """ 
     for url in self.start_urls: 
      yield scrapy.Request(url, self.parse_result, meta={ 
       'splash': { 
        'args': {'lua_source': script}, 
        'endpoint': 'execute', 
       } 
      }) 

    def parse_result(self, response): 

     # fetch base URL because response url is the Splash endpoint 
     baseurl = response.meta["_splash_processed"]["args"]["url"] 

     # decode JSON response 
     splash_json = json.loads(response.body_as_unicode()) 

     # and build a new selector from the response "html" key from that object 
     selector = scrapy.Selector(text=splash_json["html"], type="html") 

     ... 
+0

感謝您的回覆,我會嘗試後:-) – casker