從多個URL中刮取數據

我想從[鏈接] http://cbfcindia.gov.in/html/SearchDetails.aspx?mid=1&Loc=Backlog中刮取數據！，但是MID參數在URL中是遞增的，給出了第二個，第三個URL ......直到1000個URL，所以我該如何處理這個問題（我是PYTHON AND SCRAPY的新手，所以不介意我問這個問題）？從多個URL中刮取數據

請檢查我已經用於提取信息的XPATH，它是取無輸出，是那裏的蜘蛛

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from movie.items import MovieItem 

class MySpider(BaseSpider): 
    name = 'movie' 
    allowed_domains= ["http://cbfcindia.gov.in/"] 
    start_urls = ["http://cbfcindia.gov.in/html/SearchDetails.aspx?mid=1&Loc=Backlog"] 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     titles = hxs.select("//body") #Check 
     print titles 
     items = [] 
     for titles in titles: 
      print "in FOR loop" 
      item = MovieItem() 
        item ["movie_name"]=hxs.xpath('//TABLE[@id="Table2"]/TR[2]/TD[2]/text()').extract() 
      print "XXXXXXXXXXXXXXXXXXXXXXXXX movie name:", item["movie_name"] 
      item ["movie_language"] = hxs.xpath('//*[@id="lblLanguage"]/text()').extract() 
      item ["movie_category"] = hxs.xpath('//*[@id="lblRegion"]/text()').extract() 
      item ["regional_office"] = hxs.xpath('//*[@id="lblCertNo"]/text()').extract() 
      item ["certificate_no"] = hxs.xpath('//*[@id="Label1"]/text()').extract() 
      item ["certificate_date"] = hxs.xpath('//*@id="lblCertificateLength"]/text()').extract() 
      item ["length"] = hxs.xpath('//*[@id="lblProducer"]/text()').extract() 
      item ["producer_name"] = hxs.xpath('//*[@id="lblProducer"]/text()').extract() 

      items.append(item) 

      print "this is ITEMS" 
     return items

以下基本錯誤日誌：

log> 
    {'certificate_date': [], 
    'certificate_no': [], 
    'length': [], 
    'movie_category': [], 
    'movie_language': [], 
    'movie_name': [], 
    'producer_name': [], 
    'regional_office': []} 
2014-06-11 23:20:44+0530 [movie] INFO: Closing spider (finished) 
214-06-11 23:20:44+0530 [movie] INFO: Dumping Scrapy stats: 
    {'downloader/request_bytes': 256, 
    'downloader/request_count': 1, 
    'downloader/request_method_count/GET': 1, 
    'downloader/response_bytes': 6638, 
    'downloader/response_count': 1, 
    'downloader/response_status_count/200': 1, 
    'finish_reason': 'finished', 
    'finish_time': datetime.datetime(2014, 6, 11, 17, 50, 44, 54000), 
    'item_scraped_count': 1, 
    'log_count/DEBUG': 4, 
    'log_count/INFO': 7, 
    'response_received_count': 1, 
    'scheduler/dequeued': 1, 
    'scheduler/dequeued/memory': 1, 
    'scheduler/enqueued': 1, 
    'scheduler/enqueued/memory': 1, 
    'start_time': datetime.datetime(2014, 6, 11, 17, 50, 43, 681000)}

來源

2014-06-11 user3698581

我可以利用下面的代碼來創建一個START_URLS列表，但我想爲範圍（1,1000）中的我做這件事，會產生任何問題嗎？但是，我無法刮取數據，ITEMS仍然是空的start_urls = [] 我在範圍內（1,10）： url ='http://cbfcindia.gov.in/html/SearchDetails.aspx?mid ='+ str（i）+'＆Loc = Backlog' start_urls.append（url） – user3698581

允許的網域應該在沒有http://的情況下定義。例如：

allowed_domains= ["cbfcindia.gov.in/"]

如果有任何問題依然存在，那麼請表明，包括頁面的細節爬到全日誌和可能發生的任何重定向。

來源

2014-06-12 14:02:11 Talvalin

域名應該沒有斜槓 – warvariuc

除了@ Talvalin的答案，正確的XPath應該是這樣的形式：

item["movie_name"] = hxs.xpath("//*[@id='lblMovieName']/font/text()").extract()

出於某種原因，當頁面加載時，<font>標籤獲得從<span>標籤分離（或任何標籤id在）。我已經測試過這個，它工作。

警告的話，儘管：該網站幾乎免受刮擦保護。我嘗試了第二次刮，並立即扔了Runtime Error。

來源

2014-06-12 15:36:27 Manhattan

從多個URL中刮取數據

回答

相關問題