如何阻止我的抓取工具記錄重複項？

這是我到目前爲止的代碼：

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor 
from scrapy.item import Item, Field 

class MyItem(Item): 
    url=Field() 

class someSpider(CrawlSpider): 
    name = "My script" 
    domain=raw_input("Enter the domain:\n") 
    allowed_domains = [domain] 
    starting_url=raw_input("Enter the starting url with protocol:\n") 
    start_urls = [starting_url] 
    f=open("items.txt","w") 

    rules = (Rule(LxmlLinkExtractor(allow_domains=(domain)), callback='parse_obj', follow=True),) 


    def parse_obj(self,response): 
    for link in LxmlLinkExtractor(allow_domains=(self.domain)).extract_links(response): 
     item = MyItem() 
     item['url'] = link.url 
     self.f.write(item['url']+"\n")

現在會做重複成千上萬的單個環節，例如，一個vBulletin論壇，約25萬的職位。

編輯： 請注意，cralwer將獲得數以百萬計的鏈接。因此，我需要代碼才能真正快速地檢查。

來源

2016-01-23 mark

聽起來像你正在構建一個醜陋的機器人。抓取電子郵件地址可能？ – Dionys

沒有。這是我自己的網站。我需要獲取論壇網址，因此我上傳到存檔網站。 – mark

你有沒有考慮過把你的URL保存在'set（）'中？ – boardrider

創建已訪問的URL列表並檢查每個URL。所以在解析特定的URL之後，將它添加到列表中。在訪問新發現的URL的頁面之前，請檢查該URL是否已經在該列表中，並解析它並添加或跳過。

即：

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor 
from scrapy.item import Item, Field 

class MyItem(Item): 
    url=Field() 

class someSpider(CrawlSpider): 
    name = "My script" 
    domain=raw_input("Enter the domain:\n") 
    allowed_domains = [domain] 
    starting_url=raw_input("Enter the starting url with protocol:\n") 
    start_urls = [starting_url] 
    items=[] #list with your URLs 
    f=open("items.txt","w") 

    rules = (Rule(LxmlLinkExtractor(allow_domains=(domain)), callback='parse_obj', follow=True),) 


    def parse_obj(self,response): 
    for link in LxmlLinkExtractor(allow_domains=(self.domain)).extract_links(response): 
     if link not in self.items: #check if it's already parsed 
      self.items.append(link) #add to list if it's not parsed yet 
      #do your job on adding it to a file 
      item = MyItem() 
      item['url'] = link.url 
      self.f.write(item['url']+"\n")

字典版本：

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor 
from scrapy.item import Item, Field 

class MyItem(Item): 
    url=Field() 

class someSpider(CrawlSpider): 
    name = "My script" 
    domain=raw_input("Enter the domain:\n") 
    allowed_domains = [domain] 
    starting_url=raw_input("Enter the starting url with protocol:\n") 
    start_urls = [starting_url] 
    items={} #dictionary with your URLs as keys 
    f=open("items.txt","w") 

    rules = (Rule(LxmlLinkExtractor(allow_domains=(domain)), callback='parse_obj', follow=True),) 


    def parse_obj(self,response): 
    for link in LxmlLinkExtractor(allow_domains=(self.domain)).extract_links(response): 
     if link not in self.items: #check if it's already parsed 
      self.items[link]=1 #add to dictionary as key if it's not parsed yet (stored value can be anything) 
      #do your job on adding it to a file 
      item = MyItem() 
      item['url'] = link.url 
      self.f.write(item['url']+"\n")

附：您也可以先收集items，然後將其寫入文件。

對此代碼還有許多其他改進，但我將其留給你學習。

來源

2016-01-23 16:57:02 Nikita

我該怎麼做？ – mark

部分好主意。更好的方法是對訪問的URL進行哈希處理，並使用O（1）檢查它是否存在於hasmap中。 – sturcotte06

你可以使用一個python字典，存儲URL的鍵值，然後做一些類似'if myDict中的url：continue else myDict [url] = true ...（其餘代碼）''。 – BrunoRB

如何阻止我的抓取工具記錄重複項？

回答

相關問題