1
我用python的履帶框架「scrapy」我用的是pipelines.py文件到我的JSON格式項存儲庫的文件。該代碼這樣做的以下 進口JSON給出爬網程序在運行兩次時會生成重複項?
class AYpiPipeline(object):
def __init__(self):
self.file = open("a11ypi_dict.json","ab+")
# this method is called to process an item after it has been scraped.
def process_item(self, item, spider):
d = {}
i = 0
# Here we are iterating over the scraped items and creating a dictionary of dictionaries.
try:
while i<len(item["foruri"]):
d.setdefault(item["foruri"][i],{}).setdefault(item["rec"][i],{})[item["foruri_id"][i]] = item['thisurl'] + ":" + item["thisid"][i]
i+=1
except IndexError:
print "Index out of range"
# Writing it to a file
json.dump(d,self.file)
return item
問題是,當我運行我的爬蟲兩次(說),然後在我的文件中,我得到重複scar items項目。我試圖阻止它通過從文件中讀取第一次,然後匹配的數據與新的數據將被寫入,但數據從文件是JSON格式,然後我用json.loads()函數解碼它,但它不起作用:
import json
class AYpiPipeline(object):
def __init__(self):
self.file = open("a11ypi_dict.json","ab+")
self.temp = json.loads(file.read())
# this method is called to process an item after it has been scraped.
def process_item(self, item, spider):
d = {}
i = 0
# Here we are iterating over the scraped items and creating a dictionary of dictionaries.
try:
while i<len(item["foruri"]):
d.setdefault(item["foruri"][i],{}).setdefault(item["rec"][i],{})[item["foruri_id"][i]] = item['thisurl'] + ":" + item["thisid"][i]
i+=1
except IndexError:
print "Index out of range"
# Writing it to a file
if d!=self.temp: #check whether the newly generated data doesn't match the one already in the file
json.dump(d,self.file)
return item
.
請建議一種方法來做到這一點。
注:請注意,我都開在「追加」模式下的文件,因爲我可以抓取一組不同的鏈接,但與同START_URL運行履帶兩次應該寫相同的數據文件兩次
我認爲,解決方案是防止腳本的多個實例同時運行。您可以使用文件鎖定(在腳本內部或外部使用像flock這樣的實用程序)。 多個爬蟲實例的原因是什麼? – Gregory 2011-03-15 21:48:25