錯的Xpath在IMDB蜘蛛scrapy

這裏： IMDB scrapy get all movie data 錯的Xpath在IMDB蜘蛛scrapy

response.xpath（「// * [@類= '結果']/TR/TD [3]」）

返回空列表。我試圖將它改變爲：

response.xpath（「// * [含有（@類， '圖表全寬度'）]/tbody的/ TR」）

沒有成功。

請幫忙嗎？謝謝。

來源

2017-07-14 Eli S

運行它，你可以指定哪些鏈接是你是從何時會出現這個問題刮？ –

當然，例如： http://www.imdb.com/search/title?year=year=1950,1950&title_type=feature&sort=moviemeter,asc –

我不確定你在這裏要做什麼。但我檢查了網站，並且沒有帶'class' **結果的**路徑**或** **全角** –

我沒有時間徹底地通過IMDB scrapy get all movie data，但已經有了它的要點。問題陳述是從給定站點獲取所有電影數據。它涉及兩件事。首先是要瀏覽所有包含當年所有電影列表的頁面。雖然第二一個是獲得每部電影的鏈接，然後在這裏你做你自己的魔法。

您遇到的問題是獲取到每個電影的鏈接的xpath。這很可能是由於網站結構的變化（我沒有時間來驗證可能的差異）。無論如何，以下是你需要的xpath。

FIRST：

我們採取navdiv類作爲一個里程碑，找到它的孩子lister-page-next next-page類。

response.xpath("//div[@class='nav']/div/a[@class='lister-page-next next-page']/@href").extract_first()

這裏這將給：鏈接下一個頁|返回None如果在的最後一頁（自下頁標籤不存在）

第二：

這是由OP原來的疑問。

#Get the list of the container having the title, etc list = response.xpath("//div[@class='lister-item-content']") #From the container extract the required links paths = list.xpath("h3[@class='lister-item-header']/a/@href").extract()

現在您需要做的就是遍歷這些paths元素中的每一個並請求頁面。

來源

2017-07-14 21:42:20

感謝您的回答。我最終用你的XPath像這樣：

import scrapy 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 

from crawler.items import MovieItem 

IMDB_URL = "http://imdb.com" 

class IMDBSpider(CrawlSpider): 
    name = 'imdb' 
    # in order to move the next page 
    rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=("//div[@class='nav']/div/a[@class='lister-page-next next-page']",)), 
        callback="parse_page", follow= True),) 

    def __init__(self, start=None, end=None, *args, **kwargs): 
     super(IMDBSpider, self).__init__(*args, **kwargs) 
     self.start_year = int(start) if start else 1874 
     self.end_year = int(end) if end else 2017 

    # generate start_urls dynamically 
    def start_requests(self): 
     for year in range(self.start_year, self.end_year+1): 
      # movies are sorted by number of votes 
      yield scrapy.Request('http://www.imdb.com/search/title?year={year},{year}&title_type=feature&sort=num_votes,desc'.format(year=year)) 

    def parse_page(self, response): 
     content = response.xpath("//div[@class='lister-item-content']") 
     paths = content.xpath("h3[@class='lister-item-header']/a/@href").extract() # list of paths of movies in the current page 

     # all movies in this page 
     for path in paths: 
      item = MovieItem() 
      item['MainPageUrl'] = IMDB_URL + path 
      request = scrapy.Request(item['MainPageUrl'], callback=self.parse_movie_details) 
      request.meta['item'] = item 
      yield request 

    # make sure that the start_urls are parsed as well 
    parse_start_url = parse_page 

    def parse_movie_details(self, response): 
     pass # lots of parsing....

與scrapy crawl imdb -a start=<start-year> -a end=<end-year>

來源

2017-07-22 21:30:30

錯的Xpath在IMDB蜘蛛scrapy

回答

相關問題