2012-06-26 83 views
4

我是使用scrapy的新手,我試圖從一個房地產網站獲取一些信息。 該網站有一個搜索表單的主頁(GET方法)。 我試圖去我的start_requests(recherche.php)中的結果頁面,並設置我在formdata參數中的地址欄中看到的所有獲取參數。 我還設置了餅乾我有,但他沒有工作,要麼..無法通過scrapy獲取表單

這裏是我的蜘蛛:

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from scrapy.http import FormRequest, Request 

from robots_immo.items import AnnonceItem 

class ElyseAvenueSpider(BaseSpider): 
    name = "elyse_avenue" 
    allowed_domains = ["http://www.elyseavenue.com/"] 

    def start_requests(self): 
     return [FormRequest(url="http://www.elyseavenue.com/recherche.php", 
          formdata={'recherche':'recherche', 
             'compteurLigne':'2', 
             'numLigneCourante':'0', 
             'inseeVille_0':'', 
             'num_rubrique':'', 
             'rechercheOK':'recherche', 
             'recherche_budget_max':'', 
             'recherche_budget_min':'', 
             'recherche_surface_max':'', 
             'recherche_surface_min':'', 
             'recherche_distance_km_0':'20', 
             'recherche_reference_bien':'', 
             'recherche_type_logement':'9', 
             'recherche_ville_0':'' 
            }, 
          cookies={'PHPSESSID':'4e1d729f68d3163bb110ad3e4cb8ffc3', 
            '__utma':'150766562.159027263.1340725224.1340725224.1340727680.2', 
            '__utmc':'150766562', 
            '__utmz':'150766562.1340725224.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)', 
            '__utmb':'150766562.14.10.1340727680' 
            }, 
          callback=self.parseAnnonces 
          )] 



    def parseAnnonces(self, response): 
     hxs = HtmlXPathSelector(response) 
     annonces = hxs.select('//div[@id="contenuCentre"]/div[@class="blocVignetteBien"]') 
     items = [] 
     for annonce in annonces: 
      item = AnnonceItem() 
      item['nom'] = annonce.select('span[contains(@class,"nomBienImmo")]/a/text()').extract() 
      item['superficie'] = annonce.select('table//tr[2]/td[2]/span/text()').extract() 
      item['prix'] = annonce.select('span[@class="prixVignette"]/span[1]/text()').extract() 
      items.append(item) 
     return items 


SPIDER = ElyseAvenueSpider() 

當我運行的蜘蛛,是沒有問題的,但加載頁面不是好的(它是說「請指定您的搜索」,我沒有得到任何結果..)

2012-06-26 20:04:54+0200 [elyse_avenue] INFO: Spider opened 
2012-06-26 20:04:54+0200 [elyse_avenue] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2012-06-26 20:04:54+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 
2012-06-26 20:04:54+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 
2012-06-26 20:04:54+0200 [elyse_avenue] DEBUG: Crawled (200) <POST http://www.elyseavenue.com/recherche.php> (referer: None) 
2012-06-26 20:04:54+0200 [elyse_avenue] INFO: Closing spider (finished) 
2012-06-26 20:04:54+0200 [elyse_avenue] INFO: Dumping spider stats: 
    {'downloader/request_bytes': 808, 
    'downloader/request_count': 1, 
    'downloader/request_method_count/POST': 1, 
    'downloader/response_bytes': 7590, 
    'downloader/response_count': 1, 
    'downloader/response_status_count/200': 1, 
    'finish_reason': 'finished', 
    'finish_time': datetime.datetime(2012, 6, 26, 18, 4, 54, 924624), 
    'scheduler/memory_enqueued': 1, 
    'start_time': datetime.datetime(2012, 6, 26, 18, 4, 54, 559230)} 
2012-06-26 20:04:54+0200 [elyse_avenue] INFO: Spider closed (finished) 
2012-06-26 20:04:54+0200 [scrapy] INFO: Dumping global stats: 
    {'memusage/max': 27410432, 'memusage/startup': 27410432} 

感謝您的幫助!

回答

12

我會用FormRequest.from_response()它做所有的工作適合你,因爲你仍然可能會錯過一些領域:

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from scrapy.http import FormRequest, Request 

from robots_immo.items import AnnonceItem 

class ElyseAvenueSpider(BaseSpider): 

    name = "elyse_avenue" 
    allowed_domains = ["elyseavenue.com"] # i fixed this 
    start_urls = ["http://www.elyseavenue.com/"] # i added this 

    def parse(self, response): 
     yield FormRequest.from_response(response, formname='moteurRecherche', formdata={'recherche_distance_km_0':'20', 'recherche_type_logement':'9'}, callback=self.parseAnnonces) 

    def parseAnnonces(self, response): 
     hxs = HtmlXPathSelector(response) 
     annonces = hxs.select('//div[@id="contenuCentre"]/div[@class="blocVignetteBien"]') 
     items = [] 
     for annonce in annonces: 
      item = AnnonceItem() 
      item['nom'] = annonce.select('span[contains(@class,"nomBienImmo")]/a/text()').extract() 
      item['superficie'] = annonce.select('table//tr[2]/td[2]/span/text()').extract() 
      item['prix'] = annonce.select('span[@class="prixVignette"]/span[1]/text()').extract() 
      items.append(item) 
     return items 
+0

它的工作原理!非常感謝 !! – Serphone

+0

這比我的拍馬屁方法好得多! – Edwardr

+0

如果你有規則呢?您擁有的解析方法將覆蓋BaseSpider的解析方法。 – OfLettersAndNumbers

0

在您的日誌輸出中,它表示蜘蛛向http://www.elyseavenue.com/recherche.php發出POST請求,但您說表單使用GET。

果然,如果你做一個POST請求的URL並搜索「請指定搜索」:

➜ curl -d "" http://www.elyseavenue.com/recherche.php | grep "Merci de préciser votre recherche." 
% Total % Received % Xferd Average Speed Time Time  Time Dload Upload Total Spent Left Speed 
100 37494 0 37494 0  0 54582  0 --:--:-- --:--:-- --:--:-- 60866 
    <span class="Nbannonces">Merci de préciser votre recherche.</span> 

FormRequestRequest一個子類,它允許您指定的請求類型。您應指定GET,如:

FormRequest(url="http://www.elyseavenue.com/recherche.php", 
         formdata={'recherche':'recherche', 
            'compteurLigne':'2', 
            'numLigneCourante':'0', 
            'inseeVille_0':'', 
            'num_rubrique':'', 
            'rechercheOK':'recherche', 
            'recherche_budget_max':'', 
            'recherche_budget_min':'', 
            'recherche_surface_max':'', 
            'recherche_surface_min':'', 
            'recherche_distance_km_0':'20', 
            'recherche_reference_bien':'', 
            'recherche_type_logement':'9', 
            'recherche_ville_0':'' 
           }, 
         cookies={'PHPSESSID':'4e1d729f68d3163bb110ad3e4cb8ffc3', 
           '__utma':'150766562.159027263.1340725224.1340725224.1340727680.2', 
           '__utmc':'150766562', 
           '__utmz':'150766562.1340725224.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)', 
           '__utmb':'150766562.14.10.1340727680' 
           }, 
         callback=self.parseAnnonces, 
         method="GET" 
         ) 
+0

我試圖強迫方法=「GET」像你說過,但它沒有改變任何東西。我仍然得到一個POST請求,我不知道爲什麼.. – Serphone