我目前的工作scrapy下載標籤,下面是打開我正在實現分頁的頁面類似Scrapy不會從響應
https://careers-preftherapy.icims.com/jobs/search?pr=1
https://careers-preftherapy.icims.com/jobs/search?pr=2
後我spider.py代碼
class ExampleSpider(BaseSpider):
name = "example"
allowed_domains = {"careers-preftherapy.icims.com"}
start_urls = [
"https://careers-preftherapy.icims.com/jobs/search"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
pageCount = hxs.select('//td[@class = "iCIMS_JobsTablePaging"]/table/tr/td[2]/text()').extract()[0].rstrip().lstrip()[-2:].strip()
for i in range(1,int(pageCount)+1):
yield Request("https://careers-preftherapy.icims.com/jobs/search?pr=%d"%i, callback=self.parsePage)
def parsePage(self, response):
hxs = HtmlXPathSelector(response)
urls_list_odd_id = hxs.select('//table[@class="iCIMS_JobsTable"]/tr/td[@class="iCIMS_JobsTableOdd iCIMS_JobsTableField_1"]/a/@href').extract()
print urls_list_odd_id,">>>>>>>odddddd>>>>>>>>>>>>>>>>"
urls_list_even_id = hxs.select('//table[@class="iCIMS_JobsTable"]/tr/td[@class="iCIMS_JobsTableEven iCIMS_JobsTableField_1"]/a/@href').extract()
print urls_list_odd_id,">>>>>>>Evennnn>>>>>>>>>>>>>>>>"
urls_list = []
urls_list.extend(urls_list_odd_id)
urls_list.extend(urls_list_even_id)
for i in urls_list:
yield Request(i.encode('utf-8'), callback=self.parseJob)
def parseJob(self, response):
pass
這裏...........等等
我產生了每個網址的請求(假設這裏有6頁)。當scrapy達到第一個網址 我想收集所有的href標籤第一個網址 (https://careers-preftherapy.icims.com/jobs/search?pr=1)
當它達到第二個網址時收集所有的href標籤。
現在在我的代碼,你看到有在每一頁共20個HREF標記在10個HREF標籤td[@class="iCIMS_JobsTableOdd iCIMS_JobsTableField_1"]
下\ 和其餘爲td[@class="iCIMS_JobsTableEven iCIMS_JobsTableField_1"]
下。
的問題是在這裏什麼scrapy有時下載標籤和有時不我不知道發生了什麼,我的意思是,當我們運行蜘蛛文件兩次被下載和在其他時間的返回一個空列表類似下面
第1次運行:
2012-07-17 17:05:20+0530 [Preferredtherapy] DEBUG: Crawled (200) <GET https://careers-preftherapy.icims.com/jobs/search?pr=2> (referer: https://careers-preftherapy.icims.com/jobs/search)
[] >>>>>>>odddddd>>>>>>>>>>>>>>>>
[] >>>>>>>Evennnn>>>>>>>>>>>>>>>>
第二次運行
2012-07-17 17:05:20+0530 [Preferredtherapy] DEBUG: Crawled (200) <GET https://careers-preftherapy.icims.com/jobs/search?pr=2> (referer: https://careers-preftherapy.icims.com/jobs/search)
[u'https://careers-preftherapy.icims.com/jobs/1836/job', u'https://careers-preftherapy.icims.com/jobs/1813/job', u'https://careers-preftherapy.icims.com/jobs/1763/job']>>>>>>>odddddd>>>>>>>>>>>>>>>>
[preftherapy.icims.com/jobs/1811/job', u'https://careers-preftherapy.icims.com/jobs/1787/job']>>>>>>>Evennnn>>>>>>>>>>>>>>>>
我的問題是爲什麼它有時會下載,有時候不是,請嘗試回覆我,這對我真的很有幫助。
在此先感謝.....
感謝非常非常多的回覆,實際上我得到XPath的結果,當我檢查了單獨與scrapy殼「URL」,想如果我第一次運行的時候得到了page1和page2的xpath結果,當我運行第二次時,我沒有得到與xpath相同的結果,但單獨工作很好..... – 2012-07-17 14:33:48
這就是爲什麼我建議你不要運行'scrapy shell',而是調用當你沒有得到任何結果的shell - 它會打開違規頁面 – warvariuc 2012-07-17 14:58:00