如何使用scrapy將多個頁面中的數據收集到單個數據結構中

我試圖從網站中抓取數據。數據被組織爲多個對象，每個對象都有一組數據。例如，有姓名，年齡和職業的人。如何使用scrapy將多個頁面中的數據收集到單個數據結構中

我的問題是，這個數據分爲兩個級別的網站。
第一頁是一個名稱和年齡的列表，帶有指向每個人個人資料頁面的鏈接。
他們的個人資料頁面列出他們的職業。

我已經有一個用python寫的python，它可以從頂層收集數據並通過多個分頁進行爬取。
但是，如何從內部頁面收集數據，同時將其鏈接到適當的對象？

目前，我已經輸出結構用JSON作爲

{[name='name',age='age',occupation='occupation'], 
    [name='name',age='age',occupation='occupation']} etc

可以在這樣的頁面解析功能覆蓋面？

來源

2013-02-14 user2071236

這裏是你需要處理的一種方式。當物品具有所有屬性時，您需要退貨/退貨一次

yield Request(page1, 
       callback=self.page1_data) 

def page1_data(self, response): 
    hxs = HtmlXPathSelector(response) 
    i = TestItem() 
    i['name']='name' 
    i['age']='age' 
    url_profile_page = 'url to the profile page' 

    yield Request(url_profile_page, 
        meta={'item':i}, 
    callback=self.profile_page) 


def profile_page(self,response): 
    hxs = HtmlXPathSelector(response) 
    old_item=response.request.meta['item'] 
    # parse other fileds 
    # assign them to old_item 

    yield old_item

來源

2013-02-14 09:11:23

如何使用scrapy將多個頁面中的數據收集到單個數據結構中

回答

相關問題