2014-05-21 87 views
0

我試圖抓取這個https://careers-meridianhealth.icims.com/jobs/5196/licensed-practical-nurse/job(示例頁面)的數據,但無濟於事。我不知道爲什麼它始終告訴過濾器OFFSITE REQUEST到另一個網站,並且引用者沒有。我只是想獲得工作名稱,職位和它的鏈接。無論如何,這是我的代碼:無法繼續進行抓取或抓取

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.http import Request 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import HtmlXPathSelector 
from craigslist_sample.items import CraigslistSampleItem 

class MySpider(CrawlSpider): 
    name = "meridian" 
    allowed_domains = ["careers-meridianhealth.icims.com"] 
    start_urls = ["https://careers-meridianhealth.icims.com"] 



rules = (Rule (SgmlLinkExtractor(deny = path_deny_base, allow=('\d+'),restrict_xpaths=('*')) 
    , callback="parse_items", follow= True), 
    ) 


def parse_items(self, response): 
    hxs = HtmlXPathSelector(response) 
    titles = hxs.select('//div[2]/h1') 

    linker = hxs.select('//div[2]/div[8]/a[1]') 

    loc_Con = hxs.select('//div[2]/span/span/span[1]') 
    loc_Reg = hxs.select('//div[2]/span/span/span[2]') 
    loc_Loc = hxs.select('//div[2]/span/span/span[3]') 
    items = [] 
    for titles in titles: 
     item = CraigslistSampleItem() 
     #item ["job_id"] = id.select('text()').extract()[0].strip() 
     item ["title"] = map(unicode.strip, titles.select('text()').extract()) #ok 
     item ["link"] = linker.select('@href').extract() #ok 
     item ["info"] = (response.url) 
     temp1 = loc_Con.select('text()').extract() 
     temp2 = loc_Reg.select('text()').extract() 
     temp3 = loc_Loc.select('text()').extract() 
     temp1 = temp1[0] if temp1 else "" 
     temp2 = temp2[0] if temp2 else "" 
     temp3 = temp3[0] if temp3 else "" 
     item["code"] = "{0}-{1}-{2}".format(temp1, temp2, temp3) 
     items.append(item) 
    return(items) 

回答

1

如果您在使用scrapy殼檢查你的鏈接提取,你看到你的起始URL只有網站的鏈接不

[email protected]:~/tmp/stackoverflow$ scrapy shell https://careers-meridianhealth.icims.com 

In [1]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 

In [2]: lx = SgmlLinkExtractor(allow=('\d+'),restrict_xpaths=('*')) 

In [3]: lx.extract_links(response) 
Out[3]: 
[Link(url='https://www.meridianhealth.com/MH/Careers/SearchJobs/index.cfm?JobID=26322652', text=u'NURSE MANAGER ASSISTANT [OPERATING ROOM]', fragment='', nofollow=False), 
Link(url='https://www.meridianhealth.com/MH/Careers/SearchJobs/index.cfm?JobID=26119218', text=u'WEB DEVELOPER [CORP COMM & MARKETING]', fragment='', nofollow=False), 
Link(url='https://www.meridianhealth.com/MH/Careers/SearchJobs/index.cfm?JobID=30441671', text=u'HR Generalist', fragment='', nofollow=False), 
Link(url='https://www.meridianhealth.com/MH/Careers/SearchJobs/index.cfm?JobID=30435857', text=u'OCCUPATIONAL THERAPIST [BHCC REHABILITATION]', fragment='', nofollow=False), 
Link(url='https://www.meridianhealth.com/MH/1800DOCTORS.cfm', text=u'1-800-DOCTORS', fragment='', nofollow=False), 
Link(url='http://kidshealth.org/PageManager.jsp?lic=184&ps=101', text=u"Kids' Health", fragment='', nofollow=False), 
Link(url='https://www.meridianhealth.com/MH/HealthInformation/MeridianTunedin2health.cfm', text=u'Meridian Tunedin2health', fragment='', nofollow=False), 
Link(url='http://money.cnn.com/magazines/fortune/best-companies/2013/snapshots/39.html?iid=bc_fl_list', text=u'', fragment='', nofollow=False)] 

In [4]: 
下「careers-meridianhealth.icims.com」

您可以更改您的規則,添加更多的域到allowed_domains屬性,或者沒有定義allowed_attribute在所有的(所以所有域會爬,這可能意味着抓取的網頁很多)

但是,如果你仔細觀察頁面源,你會注意到我牛逼包括iframe,如果你跟隨鏈接,你會發現https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1它包含個別招聘職位:

[email protected]:~/tmp/stackoverflow$ scrapy shell https://careers-meridianhealth.icims.com 

In [1]: sel.xpath('.//iframe/@src') 
Out[1]: [<Selector xpath='.//iframe/@src' data=u'https://careers-meridianhealth.icims.com'>] 

In [2]: sel.xpath('.//iframe/@src').extract() 
Out[2]: [u'https://careers-meridianhealth.icims.com/?in_iframe=1'] 

In [3]: fetch('https://careers-meridianhealth.icims.com/?in_iframe=1') 
2014-05-21 11:53:14+0200 [default] DEBUG: Redirecting (302) to <GET https://careers-meridianhealth.icims.com/jobs?in_iframe=1> from <GET https://careers-meridianhealth.icims.com/?in_iframe=1> 
2014-05-21 11:53:14+0200 [default] DEBUG: Redirecting (302) to <GET https://careers-meridianhealth.icims.com/jobs/intro?in_iframe=1&amp;hashed=0&in_iframe=1> from <GET https://careers-meridianhealth.icims.com/jobs?in_iframe=1> 
2014-05-21 11:53:14+0200 [default] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/intro?in_iframe=1&amp;hashed=0&in_iframe=1> (referer: None) 

In [4]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 

In [5]: lx = SgmlLinkExtractor() 

In [6]: lx.extract_links(response) 
Out[6]: 
[Link(url='https://careers-meridianhealth.icims.com/jobs/login?back=intro&hashed=0&in_iframe=1', text=u'submit your resume', fragment='', nofollow=False), 
Link(url='https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1', text=u'view all open job positions', fragment='', nofollow=False), 
Link(url='https://careers-meridianhealth.icims.com/jobs/reminder?hashed=0&in_iframe=1', text=u'Reset Password', fragment='', nofollow=False), 
Link(url='https://media.icims.com/training/candidatefaq/faq.html', text=u'Need further assistance?', fragment='', nofollow=False), 
Link(url='http://www.icims.com/platform_help?utm_campaign=platform+help&utm_content=page1&utm_medium=link&utm_source=platform', text=u'Applicant Tracking Software', fragment='', nofollow=False)] 

In [7]: fetch('https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1') 
2014-05-21 11:54:24+0200 [default] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1> (referer: None) 

In [8]: lx.extract_links(response) 
Out[8]: 
[Link(url='https://careers-meridianhealth.icims.com/jobs/search?in_iframe=1&pr=1', text=u'', fragment='', nofollow=False), 
Link(url='https://careers-meridianhealth.icims.com/jobs/5196/licensed-practical-nurse/job?in_iframe=1', text=u'LICENSED PRACTICAL NURSE', fragment='', nofollow=False), 
Link(url='https://careers-meridianhealth.icims.com/jobs/5192/certified-nursing-assistant/job?in_iframe=1', text=u'CERTIFIED NURSING ASSISTANT', fragment='', nofollow=False), 
Link(url='https://careers-meridianhealth.icims.com/jobs/5191/receptionist/job?in_iframe=1', text=u'RECEPTIONIST', fragment='', nofollow=False), 
Link(url='https://careers-meridianhealth.icims.com/jobs/5190/rehabilitation-aide/job?in_iframe=1', text=u'REHABILITATION AIDE', fragment='', nofollow=False), 
Link(url='https://careers-meridianhealth.icims.com/jobs/5188/nurse-supervisor/job?in_iframe=1', text=u'NURSE SUPERVISOR', fragment='', nofollow=False), 
Link(url='https://careers-meridianhealth.icims.com/jobs/5164/lpn/job?in_iframe=1', text=u'LPN', fragment='', nofollow=False), 
Link(url='https://careers-meridianhealth.icims.com/jobs/5161/speech-pathologist-per-diem/job?in_iframe=1', text=u'SPEECH PATHOLOGIST PER DIEM', fragment='', nofollow=False), 
Link(url='https://careers-meridianhealth.icims.com/jobs/5160/social-worker-part-time/job?in_iframe=1', text=u'SOCIAL WORKER PART TIME', fragment='', nofollow=False), 
Link(url='https://careers-meridianhealth.icims.com/jobs/5154/client-care-coordinator-nights/job?in_iframe=1', text=u'CLIENT CARE COORDINATOR NIGHTS', fragment='', nofollow=False), 
Link(url='https://careers-meridianhealth.icims.com/jobs/5153/greeter/job?in_iframe=1', text=u'GREETER', fragment='', nofollow=False), 
Link(url='https://careers-meridianhealth.icims.com/jobs/5152/welcome-ambassador/job?in_iframe=1', text=u'WELCOME AMBASSADOR', fragment='', nofollow=False), 
Link(url='https://careers-meridianhealth.icims.com/jobs/5146/certified-medical-assistant-i/job?in_iframe=1', text=u'CERTIFIED MEDICAL ASSISTANT I', fragment='', nofollow=False), 
Link(url='https://careers-meridianhealth.icims.com/jobs/5142/registered-nurse-full-time/job?in_iframe=1', text=u'REGISTERED NURSE FULL TIME', fragment='', nofollow=False), 
Link(url='https://careers-meridianhealth.icims.com/jobs/5139/part-time-home-health-aide/job?in_iframe=1', text=u'PART TIME HOME HEALTH AIDE', fragment='', nofollow=False), 
Link(url='https://careers-meridianhealth.icims.com/jobs/5136/rehabilitation-tech/job?in_iframe=1', text=u'REHABILITATION TECH', fragment='', nofollow=False), 
Link(url='https://careers-meridianhealth.icims.com/jobs/5127/registered-nurse/job?in_iframe=1', text=u'REGISTERED NURSE', fragment='', nofollow=False), 
Link(url='https://careers-meridianhealth.icims.com/jobs/5123/dietary-aide/job?in_iframe=1', text=u'DIETARY AIDE', fragment='', nofollow=False), 
Link(url='https://careers-meridianhealth.icims.com/jobs/5121/tcu-administrator-%5Btransitional-care-unit%5D/job?in_iframe=1', text=u'TCU ADMINISTRATOR [TRANSITIONAL CARE UNIT]', fragment='', nofollow=False), 
Link(url='https://careers-meridianhealth.icims.com/jobs/5119/mds-coordinator/job?in_iframe=1', text=u'MDS Coordinator', fragment='', nofollow=False), 
Link(url='https://careers-meridianhealth.icims.com/jobs/5108/per-diem-patient-service-tech/job?in_iframe=1', text=u'Per Diem PATIENT SERVICE TECH', fragment='', nofollow=False), 
Link(url='https://careers-meridianhealth.icims.com/jobs/intro?in_iframe=1', text=u'Go back to the welcome page', fragment='', nofollow=False), 
Link(url='https://media.icims.com/training/candidatefaq/faq.html', text=u'Need further assistance?', fragment='', nofollow=False), 
Link(url='http://www.icims.com/platform_help?utm_campaign=platform+help&utm_content=page1&utm_medium=link&utm_source=platform', text=u'Applicant Tracking Software', fragment='', nofollow=False)] 

In [9]: 

你必須遵循分頁鏈接以獲取所有其他工作職位。

+0

那麼我將如何從工作發佈中提取數據?我已經說過,我需要遞歸地抓取,以便我可以從每個頁面提取更多的數據。 – chano

+0

你可以嘗試設置'start_urls =「https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1」]' –

+0

我看到的結果,它訪問鏈接但問題是數據在iframe中 – chano