我如何使用Itempipeline在scrapy保存項目數據庫

在parse_items這個代碼在我的蜘蛛我如何使用Itempipeline在scrapy保存項目數據庫

def parse_items(self, response): 

     hxs = HtmlXPathSelector(response) 
     sites = hxs.select("//li[@class='mod-result-entry ']") 
     items = [] 


     for site in sites[:2]: 
      item = MyItem() 
      item['title'] = myfilter(site.select('dl/a').select("string()").extract()) 
      item['company'] = myfilter(site.select('dl/h2/em').select("string()").extract()) 
      items.append(item) 
     return items

現在我要保存使用Django模型dtabase的項目。這是工作的罰款SI的一種方式，我的simpy使用這樣

item = MYapp.MyDjangoItem() 
item.title = myfilter(site.select('dl/a').select("string()").extract()) 
item.save()

現在這是工作的罰款

現在我想知道這是在數據庫中保存該法測得的。

我的意思是爲什麼我們需要在scrapy中描述的itempipeline事物。這有沒有什麼好處。

杉木E，G，這是我pipleline

class MyPipeline(object): 

    def __init__(self): 
     self.ids_seen = set() 

    def process_item(self, item, spider): 
     Myitem = Myapp.DjamgoItem() 
     Myitem.title = item['title'] 
     MyItem.save()

是正常的

又怎麼會我的代碼將調用此管道。我很困惑

來源

2012-12-11 user825904

該管道可用於消毒常見的值。如果只有一種類型的對象，這是特別有用的。通過管道保存你的django模型實例是好的，scrapy文檔中的例子通過向管道添加JsonWriter來實現。（這在現實生活中是不必要的，因爲有內置的功能爲）

大聲告訴你：

然而，當你創建幾個對象，你可能要區分你的處理。由於蜘蛛作爲參數傳遞給process_item功能，這是很容易的，但（IMO）通過，這往往會相當冗長：

class MyPipeline(object): 
    def process_item(self, item, spider): 
     if spider == 'A': 
      if item.somefield: 
       #... etc 
     elif spider == 'B': 
      #... etc

個人而言，我喜歡在Django背後的形式清潔的想法（檢查現有函數的前綴爲'clean_'）。爲了在scrapy實現類似的功能擴展我的項目類：

class ExtendedItem(Item): 
    def _process(self): 
     [getattr(self, func)() for func in dir(self) if func.split('_')[-1] in self.fields and callable(getattr(self, func))]

所以現在你可以這樣做：

class Book(ExtendedItem): 
    title = Field() 

    def _process_title(self): 
     title = self['title'].lower() 
     self.update(title=title)

您可以使用您的管道調用item._process（）在這種情況下。

免責聲明

我在github.com在不久前提出了這個想法。有可能更好的實現（代碼明智的）。

來源

2012-12-11 09:51:13

我如何使用Itempipeline在scrapy保存項目數據庫

回答

相關問題