在我的scrapy代碼中需要一點點抽搐來擺脫冗餘數據

我在scrapy中編寫了一個代碼來從yellowpage上刮咖啡店。總數據是在870左右，但我有1200左右的重複數量最少。而且，在csv輸出中，數據被放置在每個備用行中。期待某人爲我提供解決方法。提前致謝。在我的scrapy代碼中需要一點點抽搐來擺脫冗餘數據

文件夾名稱「yellpg」和「items.py」包含

from scrapy.item import Item, Field 
class YellpgItem(Item): 
    name= Field() 
    address = Field() 
    phone= Field()

蜘蛛名稱「yellsp.py」，其中包含：

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 
from yellpg.items import YellpgItem 

class YellspSpider(CrawlSpider): 
    name = "yellsp" 
    allowed_domains = ["yellowpages.com"] 
    start_urls = (
     'https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page=1', 
    ) 
    rules = (Rule(LinkExtractor(allow=('\&page=.*',)),callback='parse_item',follow=True),) 
    def parse_item(self, response): 
     page=response.xpath('//div[@class="info"]') 
     for titles in page: 
      item = YellpgItem() 
      item["name"] = titles.xpath('.//span[@itemprop="name"]/text()').extract() 
      item["address"] = titles.xpath('.//span[@itemprop="streetAddress" and @class="street-address"]/text()').extract() 
      item["phone"] = titles.xpath('.//div[@itemprop="telephone" and @class="phones phone primary"]/text()').extract() 
      yield item

要獲得CSV輸出，在命令行中我正在使用：

scrapy crawl yellsp -o items.csv

來源

2017-04-04 SIM

我可以推薦創建一個存儲項目的管道，以便稍後檢查新項目是否重複，但這不是真正的解決方案，因爲它可能會造成內存問題。

這裏真正的解決方案是，您應該避免在最終數據庫中「存儲」重複項。

定義您的項目的哪個字段將用作數據庫中的索引，並且所有內容都應該可以正常工作。

來源

2017-04-04 22:07:48 eLRuLL

嗨eLRuLL，感謝您的時間回答。重複在這裏沒什麼大不了的。它的數量非常少。也許該網站使用廣告來分析不同的咖啡店，這也是爲什麼數據量比我預期的要多。 – SIM

最好的方法是在您的管道中使用CSVItemExporter。在scrapy項目中創建一個名爲pipeline.py的文件，並添加下面的代碼行。

from scrapy import signals 
from scrapy.exporters import CsvItemExporter 

class CSVExportPipeline(object): 

    def __init__(self): 
     self.files = {} 

    @classmethod 
    def from_crawler(cls, crawler): 
     pipeline = cls() 
     crawler.signals.connect(pipeline.spider_opened, signals.spider_opened) 
     crawler.signals.connect(pipeline.spider_closed, signals.spider_closed) 
     return pipeline 

    def spider_opened(self, spider): 
     file = open('%s_coffer_shops.csv' % spider.name, 'w+b') # hard coded filename, not a good idea 
     self.files[spider] = file 
     self.exporter = CsvItemExporter(file) 
     self.exporter.start_exporting() 

    def spider_closed(self, spider): 
     self.exporter.finish_exporting() 
     file = self.files.pop(spider) 
     file.close() 

    def process_item(self, item, spider): 
     self.exporter.export_item(item) 
     return item

現在setting.py

ITEM_PIPELINES = { 
    'your_project_name.pipelines.CSVExportPipeline': 300 
    }

這種習俗CSVItemExporter添加這些線路將導出CSV樣式您的數據。如果您沒有按預期獲得數據，則可以修改process_item方法以適合您的需要。

來源

2017-04-04 22:25:24 Rahul

嗨拉胡爾，謝謝你的回答。我運行了代碼，實現了你在這裏提出的建議，但與之前收到的結果相比，根本沒有任何變化。無論如何，再次感謝。順便說一句，是沒有任何快捷方式獲取數據在CSV文件沒有備用行空白。 – SIM

在Python 2中，用模式'wb'而不是'w'打開outfile。 csv.writer直接將'\ r \ n'寫入文件。如果你沒有以二進制模式打開文件，它將寫入'\ r \ r \ n'，因爲在Windows上文本模式會將每個'\ n'轉換爲'\ r \ n'。嘗試調整此行文件= open（'％s_coffer_shops.csv'％spider.name，'w + b'）'爲'file = open（'％s_coffer_shops.csv'％spider.name，'wb'）' – Rahul

我在這裏使用python 3。 – SIM

在我的scrapy代碼中需要一點點抽搐來擺脫冗餘數據

回答

相關問題