Scrapy蜘蛛：不要爬網在列表中

目前網站我在我的scrapy蜘蛛一個規則：Scrapy蜘蛛：不要爬網在列表中

rules = [Rule(SgmlLinkExtractor(allow=['/item/\d+']), 'parse_item')]

這意味着，像www.site.com/item/所有鏈接123654得到提取，然後將被解析。 /item/後面的數字是唯一的ID。 spidering的結果將被存儲在一個json文件中。

另外，我有一個CSV文件與已經爬，我不希望這些網站獲得，以降低服務器負載再次爬到20萬左右的ID。因此，可以說我創建這個CSV等構成的蟒蛇名單：

dontparse = [123111, 123222, 123333, 123444, ...]

現在我不希望這些ID只是被忽略，如果這些鏈接被發現爬行過程中，我希望他們能夠被存儲在JSON文件，只是與信息available = true。這是如何實現的？我應該在* parse_item *函數中添加第二條規則嗎？

EDIT

我parse_item函數看起來像

def parse_item(self, response): 
    sel = Selector(response) 
    item = MyItem() 
    item['url'] = response.url 
    item['name'] = sel.xpath("//h1/text()").extract() 
    return item

來源

2014-03-25 AndiPower

我已經與scrapy沒有經驗，但爲什麼dont't你只篩選出事後使用'''dontparse'''列表中的'' 'if'''子句？或者你可以使用SgmlLinkExtractor級的參數之一，在這裏看到：http://doc.scrapy.org/en/latest/topics/link-extractors.html(fe deny_domain等）（順便說一句：你應該公開更多的代碼，ESP。的'''parse_item'''功能，得到了詳細的解答） – dorvak

SgmlLinkExtractor接受process_value調用的：

，其接收從標籤和屬性提取的每個值的函數掃描和可以修改值，並返回一個新的，或返回None完全忽視的環節。如果沒有給出， process_value默認爲lambda x: x。

所以這樣的事情應該有所幫助：

def process_value(value): 
    unique_id = re.search(r"/item/(\d+)", value).group(1) 
    if unique_id in already_crawled_site_ids: 
     return None 
    return value 

rules = [Rule(SgmlLinkExtractor(allow=['/item/\d+']), 'parse_item', process_value=process_value)]

來源

2014-03-26 04:09:47 warvariuc

Scrapy蜘蛛：不要爬網在列表中

回答

相關問題