2017-08-04 66 views
1

我正在用Scrapy來抓取列表。我的腳本首先使用parse_node解析清單網址,然後使用parse_listing解析每個清單,對於每個清單,使用parse_agent解析清單代理。我想創建一個數組,當scrapy通過列表和代理進行解析時,這些數組會隨着每個新列表的重置而重新生成。創建具有多個解析的項目的Scrapy數組

這是我分析的腳本:

def parse_node(self,response,node): 
    yield Request('LISTING LINK',callback=self.parse_listing) 
def parse_listing(self,response): 
    yield response.xpath('//node[@id="ListingId"]/text()').extract_first() 
    yield response.xpath('//node[@id="ListingTitle"]/text()').extract_first() 
    for agent in string.split(response.xpath('//node[@id="Agents"]/text()').extract_first() or "",'^'): 
    yield Request('AGENT LINK',callback=self.parse_agent) 
def parse_agent(self,response): 
    yield response.xpath('//node[@id="AgentName"]/text()').extract_first() 
    yield response.xpath('//node[@id="AgentEmail"]/text()').extract_first() 

我想parse_listing導致:

{ 
'id':123, 
'title':'Amazing Listing' 
} 

然後parse_agent添加到列表數組:

{ 
'id':123, 
'title':'Amazing Listing' 
'agent':[ 
    { 
    'name':'jon doe', 
    'email:'[email protected]' 
    }, 
    { 
    'name':'jane doe', 
    'email:'[email protected]' 
    } 
] 
} 

怎麼辦我從每個級別獲得結果並建立一個數組?

回答

3

這有些複雜發出:
您需要形成從多個不同的URL中的單個項目。

Scrapy允許你結轉數據請求中的元屬性,所以你可以這樣做:

def parse_node(self,response,node): 
    yield Request('LISTING LINK', callback=self.parse_listing) 

def parse_listing(self,response): 
    item = defaultdict(list) 
    item['id'] = response.xpath('//node[@id="ListingId"]/text()').extract_first() 
    item['title'] = response.xpath('//node[@id="ListingTitle"]/text()').extract_first() 
    agent_urls = string.split(response.xpath('//node[@id="Agents"]/text()').extract_first() or "",'^') 
    # find all agent urls and start with first one 
    url = agent_urls.pop(0) 
    # we want to go through agent urls one-by-one and update single item with agent data 
    yield Request(url, callback=self.parse_agent, 
        meta={'item': item, 'agent_urls' agent_urls}) 

def parse_agent(self,response): 
    item = response.meta['item'] # retrieve item generated in previous request 
    agent = dict() 
    agent['name'] = response.xpath('//node[@id="AgentName"]/text()').extract_first() 
    agent['email'] = response.xpath('//node[@id="AgentEmail"]/text()').extract_first() 
    item['agents'].append(agent) 
    # check if we have any more agent urls left 
    agent_urls = response.meta['agent_urls'] 
    if not agent_urls: # we crawled all of the agents! 
     return item 
    # if we do - crawl next agent and carry over our current item 
    url = agent_urls.pop(0) 
    yield Request(url, callback=self.parse_agent, 
        meta={'item': item, 'agent_urls' agent_urls}) 
1

scrapy的導入請求會創建一個散列和一個代理列表,並在該列表中附加來自請求的數據。

from scrapy import requests 

listing = { "title" : "amazing listing", "agents" : [ ] } 

agentUrls = ["list", "of", "urls", "from", "scraped", "page"] 

for agentUrl in agentUrls: 
    agentPage = requests.get(agentUrl) 
    agentTree = html.fromstring(page.content) 
    name = agentTree.xpath('//node[@id="AgentName"]/text()').extract_first() 
    email = agentTree.xpath('//node[@id="AgentEmail"]/text()').extract_first() 
    agent = { "name" : name, "email": email } 
    listings.agents.append(agent) 
相關問題