2014-05-13 83 views
2

我對Python,Scrapy和Selenium非常陌生。因此,您可以提供的任何幫助將是最受讚賞的。將Selenium HTML字符串傳遞給Scrapy以將URL添加到Scrapy URL清單

我希望能夠將從Selenium獲取的HTML作爲頁面源並將其處理爲Scrapy Response對象。主要原因是能夠將Selenium Webdriver頁面源中的URL添加到Scrapy將解析的URL列表中。

再次,任何幫助將不勝感激。

作爲第二個問題,是否有人知道如何查看Scrapy找到並刮取的網址列表中或網址中的URL列表?

謝謝!

*******編輯******* 下面是我想要做的一個例子。我想不通部分5.

class AB_Spider(CrawlSpider): 
    name = "ab_spider" 
    allowed_domains = ["abcdef.com"] 
    #start_urls = ["https://www.kickstarter.com/projects/597507018/pebble-e-paper-watch-for-iphone-and-android" 
    #, "https://www.kickstarter.com/projects/801465716/03-leagues-under-the-sea-the-seaquestor-flyer-subm"] 
    start_urls = ["https://www.abcdef.com/page/12345"] 

    def parse_abcs(self, response): 
     sel = Selector(response) 
     URL = response.url 

     #part 1: check if a certain element is on the webpage 
     last_chk = sel.xpath('//ul/li[@last_page="true"]') 
     a_len = len(last_chk) 

     #Part 2: if not, then get page via selenium webdriver 
     if a_len == 0: 
      #OPEN WEBDRIVER AND GET PAGE 
      driver = webdriver.Firefox() 
      driver.get(response.url)  

     #Part 3: run script to ineract with page until certain element appears 
     while a_len == 0: 
      print "ELEMENT NOT FOUND, USING SELENIUM TO GET THE WHOLE PAGE" 

      #scroll down one time 
      driver.execute_script("window.scrollTo(0, 1000000000);") 

      #get page source and check if last page is there 
      selen_html = driver.page_source 
      hxs = Selector(text=selen_html) 
      last_chk = hxs.xpath('//ul/li[@last_page="true"]') 

      a_len = len(last_chk) 

     driver.close() 

     #Part 4: extract the URLs in the selenium webdriver URL 
     all_URLS = hxs.xpath('a/@href').extract() 

     #Part 5: all_URLS add to the Scrapy URLs to be scraped 
+0

那你試試這麼遠嗎? – dorvak

+0

我一直無法想出任何嘗試。我不知道如何訪問Scrapy的URL隊列。我知道如何從HTML中提取URL。所以我想簡單的問題是你如何手動添加到Scrapy隊列的URL。 – user1500158

回答

1

只是yieldRequest從方法實例,並提供一個回調:

class AB_Spider(CrawlSpider): 
    ... 

    def parse_abcs(self, response): 
     ... 

     all_URLS = hxs.xpath('a/@href').extract() 

     for url in all_URLS: 
      yield Request(url, callback=self.parse_page) 

    def parse_page(self, response): 
     # Do the parsing here