我是使用scrapy的新手,我試圖從一個房地產網站獲取一些信息。 該網站有一個搜索表單的主頁(GET方法)。 我試圖去我的start_requests(recherche.php)中的結果頁面,並設置我在formdata參數中的地址欄中看到的所有獲取參數。 我還設置了餅乾我有,但他沒有工作,要麼..無法通過scrapy獲取表單
這裏是我的蜘蛛:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest, Request
from robots_immo.items import AnnonceItem
class ElyseAvenueSpider(BaseSpider):
name = "elyse_avenue"
allowed_domains = ["http://www.elyseavenue.com/"]
def start_requests(self):
return [FormRequest(url="http://www.elyseavenue.com/recherche.php",
formdata={'recherche':'recherche',
'compteurLigne':'2',
'numLigneCourante':'0',
'inseeVille_0':'',
'num_rubrique':'',
'rechercheOK':'recherche',
'recherche_budget_max':'',
'recherche_budget_min':'',
'recherche_surface_max':'',
'recherche_surface_min':'',
'recherche_distance_km_0':'20',
'recherche_reference_bien':'',
'recherche_type_logement':'9',
'recherche_ville_0':''
},
cookies={'PHPSESSID':'4e1d729f68d3163bb110ad3e4cb8ffc3',
'__utma':'150766562.159027263.1340725224.1340725224.1340727680.2',
'__utmc':'150766562',
'__utmz':'150766562.1340725224.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)',
'__utmb':'150766562.14.10.1340727680'
},
callback=self.parseAnnonces
)]
def parseAnnonces(self, response):
hxs = HtmlXPathSelector(response)
annonces = hxs.select('//div[@id="contenuCentre"]/div[@class="blocVignetteBien"]')
items = []
for annonce in annonces:
item = AnnonceItem()
item['nom'] = annonce.select('span[contains(@class,"nomBienImmo")]/a/text()').extract()
item['superficie'] = annonce.select('table//tr[2]/td[2]/span/text()').extract()
item['prix'] = annonce.select('span[@class="prixVignette"]/span[1]/text()').extract()
items.append(item)
return items
SPIDER = ElyseAvenueSpider()
當我運行的蜘蛛,是沒有問題的,但加載頁面不是好的(它是說「請指定您的搜索」,我沒有得到任何結果..)
2012-06-26 20:04:54+0200 [elyse_avenue] INFO: Spider opened
2012-06-26 20:04:54+0200 [elyse_avenue] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-06-26 20:04:54+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-06-26 20:04:54+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-06-26 20:04:54+0200 [elyse_avenue] DEBUG: Crawled (200) <POST http://www.elyseavenue.com/recherche.php> (referer: None)
2012-06-26 20:04:54+0200 [elyse_avenue] INFO: Closing spider (finished)
2012-06-26 20:04:54+0200 [elyse_avenue] INFO: Dumping spider stats:
{'downloader/request_bytes': 808,
'downloader/request_count': 1,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 7590,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 6, 26, 18, 4, 54, 924624),
'scheduler/memory_enqueued': 1,
'start_time': datetime.datetime(2012, 6, 26, 18, 4, 54, 559230)}
2012-06-26 20:04:54+0200 [elyse_avenue] INFO: Spider closed (finished)
2012-06-26 20:04:54+0200 [scrapy] INFO: Dumping global stats:
{'memusage/max': 27410432, 'memusage/startup': 27410432}
感謝您的幫助!
它的工作原理!非常感謝 !! – Serphone
這比我的拍馬屁方法好得多! – Edwardr
如果你有規則呢?您擁有的解析方法將覆蓋BaseSpider的解析方法。 – OfLettersAndNumbers