2016-01-23 69 views
0

我想知道如何阻止它多次記錄相同的URL?如何阻止我的抓取工具記錄重複項?

這是我到目前爲止的代碼:

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor 
from scrapy.item import Item, Field 

class MyItem(Item): 
    url=Field() 

class someSpider(CrawlSpider): 
    name = "My script" 
    domain=raw_input("Enter the domain:\n") 
    allowed_domains = [domain] 
    starting_url=raw_input("Enter the starting url with protocol:\n") 
    start_urls = [starting_url] 
    f=open("items.txt","w") 

    rules = (Rule(LxmlLinkExtractor(allow_domains=(domain)), callback='parse_obj', follow=True),) 


    def parse_obj(self,response): 
    for link in LxmlLinkExtractor(allow_domains=(self.domain)).extract_links(response): 
     item = MyItem() 
     item['url'] = link.url 
     self.f.write(item['url']+"\n") 

現在會做重複成千上萬的單個環節,例如,一個vBulletin論壇,約25萬的職位。

編輯: 請注意,cralwer將獲得數以百萬計的鏈接。 因此,我需要代碼才能真正快速地檢查。

+0

聽起來像你正在構建一個醜陋的機器人。抓取電子郵件地址可能? – Dionys

+0

沒有。這是我自己的網站。我需要獲取論壇網址,因此我上傳到存檔網站。 – mark

+0

你有沒有考慮過把你的URL保存在'set()'中? – boardrider

回答

2

創建已訪問的URL列表並檢查每個URL。所以在解析特定的URL之後,將它添加到列表中。在訪問新發現的URL的頁面之前,請檢查該URL是否已經在該列表中,並解析它並添加或跳過。

即:

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor 
from scrapy.item import Item, Field 

class MyItem(Item): 
    url=Field() 

class someSpider(CrawlSpider): 
    name = "My script" 
    domain=raw_input("Enter the domain:\n") 
    allowed_domains = [domain] 
    starting_url=raw_input("Enter the starting url with protocol:\n") 
    start_urls = [starting_url] 
    items=[] #list with your URLs 
    f=open("items.txt","w") 

    rules = (Rule(LxmlLinkExtractor(allow_domains=(domain)), callback='parse_obj', follow=True),) 


    def parse_obj(self,response): 
    for link in LxmlLinkExtractor(allow_domains=(self.domain)).extract_links(response): 
     if link not in self.items: #check if it's already parsed 
      self.items.append(link) #add to list if it's not parsed yet 
      #do your job on adding it to a file 
      item = MyItem() 
      item['url'] = link.url 
      self.f.write(item['url']+"\n") 

字典版本:

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor 
from scrapy.item import Item, Field 

class MyItem(Item): 
    url=Field() 

class someSpider(CrawlSpider): 
    name = "My script" 
    domain=raw_input("Enter the domain:\n") 
    allowed_domains = [domain] 
    starting_url=raw_input("Enter the starting url with protocol:\n") 
    start_urls = [starting_url] 
    items={} #dictionary with your URLs as keys 
    f=open("items.txt","w") 

    rules = (Rule(LxmlLinkExtractor(allow_domains=(domain)), callback='parse_obj', follow=True),) 


    def parse_obj(self,response): 
    for link in LxmlLinkExtractor(allow_domains=(self.domain)).extract_links(response): 
     if link not in self.items: #check if it's already parsed 
      self.items[link]=1 #add to dictionary as key if it's not parsed yet (stored value can be anything) 
      #do your job on adding it to a file 
      item = MyItem() 
      item['url'] = link.url 
      self.f.write(item['url']+"\n") 

附:您也可以先收集items,然後將其寫入文件。

對此代碼還有許多其他改進,但我將其留給你學習。

+0

我該怎麼做? – mark

+2

部分好主意。更好的方法是對訪問的URL進行哈希處理,並使用O(1)檢查它是否存在於hasmap中。 – sturcotte06

+1

你可以使用一個python字典,存儲URL的鍵值,然後做一些類似'if myDict中的url:continue else myDict [url] = true ...(其餘代碼)''。 – BrunoRB