scrapy多個項目類與他們內的提取方法

只是爲了說明：我不是一個有經驗的程序員，不要生我的氣...... 我正在探索scrapy的可能性（我有一些Python編程技巧）。scrapy多個項目類與他們內的提取方法

刮痧網站：讓我們來想象我們能有從opengraph提取一些信息（OG :)像「標題」，「鏈接」和「說明」，以及其它信息從schema.org，像'作者'，最後我們想要'標題'，'網址'，'描述'和'日期'，可以從HTML提取「正常」XPath只是如果沒有可用從opengraph og :)和schema.org。

我創建了3個項目類OpengraphItem（Item），SchemaItem（Item）和MyItem（Item），在分離的.py文件中。每個類內將有一個提取功能提取的字段，如下例所示：

class OpengraphItem(Item): 
     title = Field() 
     url = Field() 
     description = Field() 

     def extract(self, hxs): 
      self.title = hxs.xpath('/html/head/meta[@property="og:title"]/@content').extract() 
      self.url = hxs.xpath('/html/head/meta[@property="og:url"]/@content').extract() 
      self.description = hxs.xpath('/html/head/meta[@property="og:description"]/@content').extract()

然後在蜘蛛代碼，所述提取物的功能將被調用這樣的：

def parse_item(self, response): 
    hxs = HtmlXPathSelector(response) 

    my_item = MyItem() 
    item_opengraph = OpengraphItem() 
    item_opengraph.extract(hxs) 

    item_schema = SchemaItem() 
    item_schema.extract(hxs) 

     my_item['date']= hxs.xpath('/html/body//*/div[@class="reviewDate"]/span/time[@class="dtreviewed"]/@content').extract() 

     my_item['title'] = item_opengraph.get('title', None) 
     my_item['url'] = item_opengraph.get('url', None) 
     my_item['description'] = item_opengraph.get('description', None) 

     if my_item['url'] == None: 
      my_item['url'] = response.url 

     if my_item['title'] == None: 
      my_item['title'] = hxs.xpath('/html/head/title/text()').extract() 

     if my_item['description'] == None: 
      my_item['description'] = hxs.xpath('/html/head/meta[@name="description"]/@content').extract() 

     return my_item

這有意義嗎？在物品類中創建提取方法是很方便的嗎？

我看了一下其他的問題： scrapy crawler to pass multiple item classes to pipeline - 我不知道是否只有一個items.py與多個不同的類內部是正確的。

Scrapy item extraction scope issue和scrapy single spider to pass multiple item classes to pipeline - 我應該有一個Itempipeline嗎？我不熟悉這些，但在scrapy文檔中說明了它的用途，我認爲它不適合這個問題。和項目裝載機？

我忽略了部分代碼。

來源

2015-04-16 Inês Martins

是的。你可以把類放在不同的文件中，或放在同一個文件中。如果他們在不同的文件中，請確保您正確導入它們。 – MattDMo

看來你在問你很多問題。如果你提出了不同的問題，每個具體的疑問或問題都可能會更清楚。 –

It is rigth to have the created extract method inside items class?

這是非常不尋常的。我不能說這是不正確的，因爲代碼仍然可以工作，但通常所有與頁面結構相關的代碼（如選擇器）都保留在Spider中。

項目載入器可能對你正在嘗試做的事很有用，你應該試試看。

另一件事，屬性分配到項目領域，如

def extract(self, hxs): 
     self.title = hxs [...]

將無法正常工作。 Scrapy項目就像是字典，你應該分配給例如self['title']。

來源

2015-04-16 21:17:11

感謝您的回答，我試過 def extract（self，hxs）： self。['title'] = hxs [...] 但它給出錯誤：「#exceptions。AttributeError：使用item ['title'] = [...]來設置字段值^ SyntaxError：無效的語法「 –

self ['title'] without」。「 –

感謝您的幫助 –

scrapy多個項目類與他們內的提取方法

回答

相關問題