2014-02-24 82 views
1

我有一個scrapy蜘蛛解析這個linkScrapy怪異輸出

我的蜘蛛看起來如下:

from scrapy.spider import BaseSpider 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.http import request 
from scrapy.selector import HtmlXPathSelector 
from medsynergies.items import MedsynergiesItem 

class methodistspider(BaseSpider): 

    name="samplemedsynergies" 
    allowed_domains=['msi-openhire.silkroad.com/epostings/'] 
    start_urls=['https://msi-openhire.silkroad.com/epostings/index.cfm?fuseaction=app.jobinfo&jobid=1284&source=ONLINE&JobOwner=992700&company_id=16616&version=1&byBusinessUnit=NULL&bycountry=0&bystate=0&byRegion=&bylocation=NULL&keywords=&byCat=NULL&proximityCountry=&postalCode=&radiusDistance=&isKilometers=&tosearch=yes'] 

    #rules=(
    #Rule(SgmlLinkExtractor(allow=("epostings/index.cfm?fuseaction=app%2Ejobsearch&company_id",))), 
    #Rule(SgmlLinkExtractor(allow=("epostings/index.cfm?fuseaction=app.jobinfo&jobid",)),callback="parse_job",follow=True), 
    #) 

    def parse(self, response): 
     hxs=HtmlXPathSelector(response) 
    titles=hxs.select('//*[@id="jobDesciptionDiv"]') 
    items = [] 

    for titles in titles: 
     item=MedsynergiesItem() 
     item['job_id']=response.url 
     item['title']=titles.select('//*[@id="jobTitleDiv"]/text()').extract() 
     item['tracking_code']=titles.select('//*[@id="trackCodeDiv"]/text()').extract() 
     item['job_description']=titles.select('.//p/text()').extract() 
     item['responsibilities']=titles.select('.//ul/li/text()').extract() 
     item['required_skills']=titles.select('//*[@id="jobRequiredSkillsDiv"]/ul/text()').extract() 
     item['job_location']=titles.select('//*[@id="jobPositionLocationDiv"]/text()').extract() 
     item['position_type']=titles.select('//*[@id="translatedJobPostingTypeDiv"]/text()').extract() 
     items.append(item) 
    print items 
    return items 

輸出我得到如下所示:

> [{'job_description': [u'The Operations Solution Architect creates the 
> technical vision for Revenue Cycle Management delivery capabilities, 
> ensuring that interdependent applications and infrastructures are 
> aligned. The SA effectively translates business needs into supportable 
> solutions that deliver an excellent customer experience.', 
>      u'Responsibilities:'], 'job_id': 'https://msi-openhire.silkroad.com/epostings/index.cfm?fuseaction=app.jobinfo&jobid=1284&source=ONLINE&JobOwner=992700&company_id=16616&version=1&byBusinessUnit=NULL&bycountry=0&bystate=0&byRegion=&bylocation=NULL&keywords=&byCat=NULL&proximityCountry=&postalCode=&radiusDistance=&isKilometers=&tosearch=yes', 
> 'job_location': [u'\r\n\t\t\t\t\t\tIrving, Texas, United 
> States\r\n\t\t\t\t\t'], 'position_type': 
> [u'\r\n\t\t\t\t\t\tFull-Time/Regular\r\n\t\t\t\t\t'], 
> 'required_skills': [u'\r\n', 
>      u'\r\n', 
>      u'\r\n', 
>      u'\r\n', 
>      u'\r\n', 
>      u'\r\n', 
>      u'\r\n', 
>      u'\r\n', 
>      u'\r\n', 
>      u'\r\n', 
>      u'\r\n', 
>      u'\r\n'], 'responsibilities': [u'Utilizes technical expertise to create strategic technical vision and 
> architecting solutions for Revenue Cycle Manage delivery 
> capabilities.', 
>      u'Responsible for gathering requirements, architecting the overall design, and executing the design and build 
> phases to ensure RCM solutions and related infrastructures are 
> effectively aligned.', 
>      u'Defines key milestones and deliverables related to new developments in collaboration with senior management 
> and stakeholders.', 
>      u'Collaborates with Solutions Design, ITS and Operations Implementation team to define, design, price and execute 
> new service requirements, new customer accounts, and expanded scope of 
> services.', 
>      u'Develops portfolio strategic plan to ensure alignment with Industry trends and market needs to retaining 
> MedSynergies industry leadership status.', 
>      u'Provides analysis, opportunity assessments and recommendations to optimize and profitably grow portfolio in alignment 
> with established business strategy and goals.\xa0', 
>      u'Performs risk evaluations to ensure that business strategies and evaluations are implemented with clarity and 
> consistency.', 
>      u'Serves as senior subject matter expert on content, processes, and procedures for applicable portfolio 
> offerings.', 
>      u'Tracks project milestones and deliverables. Develops and delivers progress reports presentations to stake holders 
> and senior management', 
>      u'Assists with the transfer of knowledge of technical skills. Provides coaching to less experienced employees.', 
>      u'Participates in special projects and/or completes other duties as assigned.'], 'title': 
> [u'\r\n\t\t\t\t\tSolutions Architect\r\n\t\t\t\t'], 'tracking_code': 
> [u'\r\n\t\t\t\t\t\tTracking Code\r\n\t\t\t\t\t']}] 

所以我的問題是:我想知道是否有更好的方法來定義我的xpath,以便在輸出中不會出現換行符(\ n)和製表符(\ t)。 另外required_skills字段無法從字段中刮取任何文本。我想知道我有什麼錯誤。

預先感謝您!

回答

2

如果你知道你可以期望從一個XPath表達式1個輸出字符串值,你可以在normalize-space()包裝你的XPath。此外,for title in titles循環中,你應該使用相對 XPath表達式(從.//,不是絕對的XPath表達式開始//

例如:

item['tracking_code']=titles.select('normalize-space(.//*[@id="trackCodeDiv"]/text())').extract() 

對於required_skills,我建議你試試normalize-space(.//*[@id="jobRequiredSkillsDiv"]/ul)

item['required_skills']=titles.select('normalize-space(.//*[@id="jobRequiredSkillsDiv"]/ul)').extract()  
1

您可以使用Python清潔:

def clean(item): 
    data = {} 
    for k, v in item.iteritems(): 
     data[k] = ' '.join([val.strip() for val in v]).strip() 
    return data