2015-02-10 49 views
1

假設存在具有以下json結構的產品,即具有多個要鏈接的鏈接的產品。如何在scrapy中加入解析()結果

[ 
    { 
    "id": "888", 
    "suppliers": { 
     "shop1": { 
     "url": "http://www.example1.com./item1", 
     "price": "19.99", 
     }, 
     "shop2": { 
     "url": "http://www.example2.com./item2", 
     "price": "29.95", 
     } 
    } 
    } 
] 

我正在使用Scrapy來抓取這兩個網站並更新價格。 除了Scrapy分別返回兩個結果外,一切正常。

如何「結合」來自兩個鏈接的結果?即在一條線上形成像上述json結構一樣的單個物體?

這是我正在使用的現有片段。任何幫助將不勝感激。

class ProductSpider(Spider): 
    name = "productspider" 
    allowed_domains = ['example1.com', 'example2.com'] 
    start_urls = ['http://www.example1.com./item1', 'http://www.example2.com./item2'] 

    def parse(self, response):  
     item = ProductItem() 
     item['id'] = '888' 
     item['suppliers'] = {'shop1':'', 'shop2':''} 

     if (response.meta['download_slot'] == 'www.example1.com'): 
      parse_example1_page() # and assign it to item shop1 

     if (response.meta['download_slot'] == 'www.example2.com'): 
      parse_example2_page() # and assign it to item shop2 

     yield item 
+0

難道只有你們兩個需要訪問,形成一個項目的URL,或者你需要無限擴展呢?謝謝。 – alecxe 2015-02-10 01:09:20

+0

爲了簡單起見,我使用了兩個網址,可能最多爲10個或其他任意數字 – Chung 2015-02-10 01:11:46

+0

好的,您事先知道所有的網址,它們都保存在'start_urls'裏面,對嗎?謝謝。 – alecxe 2015-02-10 01:13:46

回答

0

您所需的輸出是重新組織您所抓取的數據。試圖將提取和處理部分結合起來會很脆弱,而且可能很難理解。抓取的數據甚至可能以其原始形式有用(可以組合不同的抓取,執行不同的處理等)。考慮將任務分成兩部分:抓取數據並處理重新格式化。你已經有了抓取部分,這是一個後處理的例子。我使用了一種簡單的單行記錄json格式,其優點是不需要將整個(原始)數據集加載到內存中。你可以使用你喜歡的任何中間存儲。

import json 
from collections import defaultdict 

# the (fake) fetching 
scrapy_data = [ {"id":"888", "url":"blah.com/888", "shop":"shop1", "price": 99.2 }, 
{"id":"3", "url":"blah.com/3", "shop":"shop1", "price": 33.1 }, 
{"id":"888", "url":"foo.com/888", "shop":"shop2", "price": 423.0 }, 
{"id":"42", "url":"foo.com/42", "shop":"shop2", "price": 1.20 }, ] 

with open('records.json','w') as fh: 
    # pretend the data items are coming from scrapy 
    for item in scrapy_data: 
     json.dump(item, fh) 
     fh.write("\n") 


# the (real) processing 
products = defaultdict(dict) 

with open('records.json') as fh: 
    for line in fh: 
     item = json.loads(line) 
     pid, url, shop, price = item["id"], item["url"], item["shop"], item["price"] 
     products[pid][shop] = {"url": url, "price":price} 

collated = [ { "id": key, "suppliers":val } for key, val in products.iteritems() ] 

print(json.dumps(collated, sort_keys=True, indent=2)) 

輸出看起來像:

[ 
    { 
    "id": "3", 
    "suppliers": { 
     "shop1": { 
     "price": 33.1, 
     "url": "blah.com/3" 
     } 
    } 
    }, 
    { 
    "id": "888", 
    "suppliers": { 
     "shop1": { 
     "price": 99.2, 
     "url": "blah.com/888" 
     }, 
     "shop2": { 
     "price": 423.0, 
     "url": "foo.com/888" 
     } 
    } 
    }, 
    { 
    "id": "42", 
    "suppliers": { 
     "shop2": { 
     "price": 1.2, 
     "url": "foo.com/42" 
     } 
    } 
    } 
]