2015-04-01 47 views
1

晚上好一切,避免刮數據已經刮

我仍然對我的蜘蛛工作從新聞網站刮的數據,但遇到了另外一個問題,我原來的問題被張貼在這裏:Scrapy outputs [ into my .json file但已經解決。

我已經設法得到了一些進一步,必須讓空物品的補貼和添加搜索功能,我現在試圖刮只有我還沒有刮的文章(考慮到我仍然想要從中提取鏈接)。我找不出將代碼放在哪裏:

a。)定義最後一次抓取的時間是什麼 b。)比較文章的日期和上次抓取的日期。

我可能只是在努力與邏輯,所以我轉向你。

我蜘蛛:

# tabbing in python is apparently VERY important so be aware and make sure 
# things that should line up do so 

# import the CrawlSpider Class, along with it's Rules, (this lets us recursively 
# crawl pages) 

from scrapy.contrib.spiders import CrawlSpider, Rule 

#import the link extractor, this extracts links from pages 

from scrapy.contrib.linkextractors import LinkExtractor 

# import our items as defined in items.py 

from basic.items import BasicItem 

# import datetime so that we can get the current date and time 

import time 

# import re which allows us to compare strings 

import re 

# create a new Spider with the CrawlSpider Class 

class BasicSpiderSpider(CrawlSpider): 

    # Name of the spider, this is used to run it, (i.e Scrapy Crawl basic_spider) 

    name = "basic_spider" 

    # domains that the spider is allowed to crawl over 

    allowed_domains = ["news24.com"] 

    # where to start crawling from 

    start_urls = [ 
     'http://www.news24.com', 
    ] 

    # Rules for the link extractor, (i.e where it's allowed to look for links, 
    # what to do once it's found them, and whether it's allowed to follow them 

    rules = (Rule (LinkExtractor(), callback="parse_items", follow= True), 
    ) 

    # defining the callback function 

    def parse_items(self, response): 

     # defines the Top level XPath where all of our information can be found, needs to be 
     # as specific as possible to avoid duplicates 

     for title in response.xpath('//*[@id="aspnetForm"]'): 

      # List of keywords to search through. 

      key = re.compile("joburg|durban", re.IGNORECASE) 

      # extracting the data to compare with the keywords, this is for the 
      # headlines, the join converts it from a list type to a string type 

      headlist = title.xpath('//*[@id="article_special"]//h1/text()').extract() 
      head = ''.join(headlist) 

      # and this is for the article. 

      artlist = title.xpath('//*[@id="article-body"]//text()').extract() 
      art = ''.join(artlist) 

      # if any keywords are found in the headline: 

      if key.search(head): 
       if last_crawled > response.xpath('//*[@id="spnDate"]/text()').extract() 
        # define the top level xpath again as python won't look outside 
        # it's current fuction 

        for thing in response.xpath('//*[@id="aspnetForm"]'): 

         # fills the items defined in items.py with relevant data 

         item = BasicItem() 
         item['Headline'] = thing.xpath('//*[@id="article_special"]//h1/text()').extract() 
         item["Article"] = thing.xpath('//*[@id="article-body"]/p[1]/text()').extract() 
         item["Date"] = thing.xpath('//*[@id="spnDate"]/text()').extract() 
         item["Link"] = response.url 

         # I found that even with being careful about my XPaths I 
         # still got empty fields and lines, the below fixes that 

         if item['Headline']: 
          if item["Article"]: 
           if item["Date"]: 
            last_crawled = (time.strftime("%Y-%m-%d %H:%M")) 
            yield item 

      # if the headline item doesn't match, check the article item. 

      elif key.search(art): 
       if last_crawled > response.xpath('//*[@id="spnDate"]/text()').extract() 
        for thing in response.xpath('//*[@id="aspnetForm"]'): 
         item = BasicItem() 
         item['Headline'] = thing.xpath('//*[@id="article_special"]//h1/text()').extract() 
         item["Article"] = thing.xpath('//*[@id="article-body"]/p[1]/text()').extract() 
         item["Date"] = thing.xpath('//*[@id="spnDate"]/text()').extract() 
         item["Link"] = response.url 

         if item['Headline']: 
          if item["Article"]: 
           if item["Date"]: 
            last_crawled = (time.strftime("%Y-%m-%d %H:%M")) 
            yield item 

它不工作,但正如我所說,我是持懷疑態度的邏輯反正有人可以讓我知道如果我在這裏在正確的軌道上?

再次感謝所有的幫助。

回答

2

您似乎完全不在上下文中使用last_crawled。但也懶得多用它,你會多使用deltafetch中間件的更好,爲創建正是你正在嘗試做的:

這是一個蜘蛛中間件的忽略可見含 項目頁面在以前的同一個蜘蛛爬網中,因此產生僅包含新項目的 「三角洲爬網」。

要使用deltafetch,安裝scrapylib第一:

pip install scrapylib 

,之後,使其能夠在settings.py

SPIDER_MIDDLEWARES = { 
    'scrapylib.deltafetch.DeltaFetch': 100, 
} 

DELTAFETCH_ENABLED = True 
+0

感謝勞倫斯,我所遇到的引用在試圖找到deltafetch答案是這樣的,但它並沒有滿足我的需求,我希望訪問該頁面並從中提取鏈接以遵循(如果有新的相關文章或其他類似的文章)。即使不想從中提取項目數據,deltafetch仍會提取要從頁面跟隨的鏈接? 將有一個遊戲無論並讓你知道,感謝您的答覆! – 2015-04-02 06:21:39

+0

盡我所能,完美地工作,謝謝! – 2015-04-02 13:46:24