爬網程序在運行兩次時會生成重複項？

我用python的履帶框架「scrapy」我用的是pipelines.py文件到我的JSON格式項存儲庫的文件。該代碼這樣做的以下進口JSON給出爬網程序在運行兩次時會生成重複項？

class AYpiPipeline(object): 
def __init__(self): 
    self.file = open("a11ypi_dict.json","ab+") 


# this method is called to process an item after it has been scraped. 
def process_item(self, item, spider): 
    d = {}  
    i = 0 
# Here we are iterating over the scraped items and creating a dictionary of dictionaries. 
try: 
    while i<len(item["foruri"]): 
     d.setdefault(item["foruri"][i],{}).setdefault(item["rec"][i],{})[item["foruri_id"][i]] = item['thisurl'] + ":" + item["thisid"][i] 
    i+=1 
except IndexError: 
    print "Index out of range" 
    # Writing it to a file 
    json.dump(d,self.file) 
return item

問題是，當我運行我的爬蟲兩次（說），然後在我的文件中，我得到重複scar items項目。我試圖阻止它通過從文件中讀取第一次，然後匹配的數據與新的數據將被寫入，但數據從文件是JSON格式，然後我用json.loads（）函數解碼它，但它不起作用：

import json 

class AYpiPipeline(object): 
    def __init__(self): 
     self.file = open("a11ypi_dict.json","ab+") 
     self.temp = json.loads(file.read()) 

    # this method is called to process an item after it has been scraped. 
    def process_item(self, item, spider): 
     d = {}  
     i = 0 
     # Here we are iterating over the scraped items and creating a dictionary of dictionaries. 
     try: 
      while i<len(item["foruri"]): 
      d.setdefault(item["foruri"][i],{}).setdefault(item["rec"][i],{})[item["foruri_id"][i]] = item['thisurl'] + ":" + item["thisid"][i] 
      i+=1 
     except IndexError: 
      print "Index out of range" 
     # Writing it to a file 

      if d!=self.temp: #check whether the newly generated data doesn't match the one already in the file 
        json.dump(d,self.file) 
     return item 
    .

請建議一種方法來做到這一點。

注：請注意，我都開在「追加」模式下的文件，因爲我可以抓取一組不同的鏈接，但與同START_URL運行履帶兩次應該寫相同的數據文件兩次

來源

2011-03-15 station

我認爲，解決方案是防止腳本的多個實例同時運行。您可以使用文件鎖定（在腳本內部或外部使用像flock這樣的實用程序）。多個爬蟲實例的原因是什麼？ – Gregory 2011-03-15 21:48:25

您可以使用一些自定義中間件過濾掉重複項，例如this。然而，爲了在蜘蛛中實際使用它，你需要兩件事：爲項目分配ID，以便過濾器可以識別重複項，以及在蜘蛛程序運行之間持續訪問ID的設置。第二個很簡單 - 你可以使用像pythonic這樣的貨架，或者你可以使用現在很流行的許多重要價值商店之一。儘管如此，第一部分將會變得更加困難，並且取決於您要解決的問題。

來源

2011-03-15 22:39:08 rmalouf

爬網程序在運行兩次時會生成重複項？

回答

相關問題