scrapy刮數據中包含的JavaScript

我試圖颳去 http://virtuacareers.com/new-jersey/staff-nurse/jobid3462987-registered-nurse-%28rn%29-jobs scrapy刮數據中包含的JavaScript

一個數據，我想從這個頁面的鏈接，但是當我看着我的csv文件，鏈接是：

javascript:GetApplyClickCount('https://careers-virtua.icims.com/jobs/5587/1024245/job?apply=yes&hashed=58168622', 'http://virtuacareers.com/list.aspx?state=voorhees&category=staff+nurse&jobtitle=registered+nurse+(rn)&jobid=3025458&dmaid=1286&dmaname=voorhees', 'SameWindow', 'scrollbars=1, toolbar=1, resizable=1, location=1, directories=1, status=1, menubar=1, copyhistory=1, fullscreen=1', 'true', '0', '0', 'virtuacareers.com', '', '', '3025458', 'Registered Nurse (RN)','212','True','','False');

什麼，我只希望得到的是：

https://careers-virtua.icims.com/jobs/5587/1024245/job?apply=yes&hashed=58168622

我應該怎麼做這些？這是我對這個

linker = hxs.select('//div[@class="box jobDesc"]/a') 
item ["link"] = linker.select('@href').extract()

來源

2013-08-29 chano

一種方式的代碼是提取使用正則表達式網址：

>>> import re 
>>> s = "javascript:GetApplyClickCount('https://careers-virtua.icims.com/jobs/5587/1024245/job?apply=yes&hashed=58168622', 'http://virtuacareers.com/list.aspx?state=voorhees&category=staff+nurse&jobtitle=registered+nurse+(rn)&jobid=3025458&dmaid=1286&dmaname=voorhees', 'SameWindow', 'scrollbars=1, toolbar=1, resizable=1, location=1, directories=1, status=1, menubar=1, copyhistory=1, fullscreen=1', 'true', '0', '0', 'virtuacareers.com', '', '', '3025458', 'Registered Nurse (RN)','212','True','','False');" 
>>> re.search("\'(?P<url>https?://[^\s]+)\'", s).group("url") 
'https://careers-virtua.icims.com/jobs/5587/1024245/job?apply=yes&hashed=58168622'

你的情況，那就是：

link = linker.select('@href').extract()[0] 
item ["link"] = re.search("\'(?P<url>https?://[^\s]+)\'", link).group("url")

來源

2013-08-29 09:37:20 alecxe

有錯誤發生例外。 TypeError：期望的字符串或緩衝區 – chano

嘗試'item [「link」] = re.search（「\'（？P https？：// [^ \ s] +）\'」，鏈接[0]）。（「url」）' –

@alecxe ind eed;）不客氣 –

scrapy刮數據中包含的JavaScript

回答

相關問題