Split scrapy的大型CSV文件

是否可以將scrapy寫入每個文件不超過5000行的CSV文件？我怎樣才能給它一個自定義的命名方案？我應該修改CsvItemExporter？Split scrapy的大型CSV文件

2014-01-08 Crypto

您使用的是Linux嗎？

split命令對於這種情況非常有用。

split -l 5000 -d --additional-suffix .csv items.csv items-

查看split --help的選項。

來源

2014-01-09 06:03:24 Rolando

是的，我是。我刮的網站非常龐大，有數百萬頁。我認爲從scrapy本身做起來可能會更好，而不是在刮板完成工作之前從cron運行拆分命令。 – Crypto

@Crypto，在這種情況下，您可以繼承'FeedExporter'類並修改'item_scraped'方法來保留一個計數器並在達到極限時重新打開導出器。這可以通過調用close_spider和open_spider方法來完成。但是你需要注意設置文件名並正確處理'close_spider'返回的延遲。儘管將出口商適應你的用例可能會非常棘手，但更簡單的方法是創建一個管道，在沒有任何子類化的情況下完成你所需要的任務。 – Rolando

嘗試這條管道：

# -*- coding: utf-8 -*- 

# Define your item pipelines here 
# 
# Don't forget to add your pipeline to the ITEM_PIPELINES setting 
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html 

from scrapy.exporters import CsvItemExporter 

import datetime 

class MyPipeline(object): 

    def __init__(self, stats): 
     self.stats = stats 
     self.base_filename = "result/amazon_{}.csv" 
     self.next_split = self.split_limit = 50000 # assuming you want to split 50000 items/csv 
     self.create_exporter() 

    @classmethod 
    def from_crawler(cls, crawler): 
     return cls(crawler.stats) 

    def create_exporter(self): 
     now = datetime.datetime.now() 
     datetime_stamp = now.strftime("%Y%m%d%H%M") 
     self.file = open(self.base_filename.format(datetime_stamp),'w+b') 
     self.exporter = CsvItemExporter(self.file) 
     self.exporter.start_exporting()  

    def process_item(self, item, spider): 
     if (self.stats.get_stats()['item_scraped_count'] >= self.next_split): 
      self.next_split += self.split_limit 
      self.exporter.finish_exporting() 
      self.file.close() 
      self.create_exporter 
     self.exporter.export_item(item) 
     return item

不要忘了管道添加到您的設置：

ITEM_PIPELINES = { 
    'myproject.pipelines.MyPipeline': 300, 
}

來源

2016-11-17 12:22:38

Split scrapy的大型CSV文件

回答

相關問題