2014-01-24 35 views
1

所以我讓我的刮刀工作與一個表單請求。我甚至可以看到終端打印出刮數據隨着從該單頁版本:Scrapy存儲跨多個formrequest頁面的項目?元? python

class MySpider(BaseSpider): 
    name = "swim" 
    start_urls = ["example.website"] 
    DOWNLAD_DELAY= 30.0 

    def parse(self, response): 
     return [FormRequest.from_response(response,formname="TTForm", 
        formdata={"Ctype":"A", "Req_Team": "", "AgeGrp": "0-6", 
        "lowage": "", "highage": "", "sex": "W", "StrkDist": "10025", 
        "How_Many": "50", "foolOldPerl": ""} 
        ,callback=self.swimparse1,dont_click=True)] 

    def swimparse1(self, response):  
     open_in_browser(response) 
     hxs = Selector(response) 
     rows = hxs.xpath(".//tr") 
     items = [] 

     for rows in rows[4:54]: 
      item = swimItem() 
      item["names"] = rows.xpath(".//td[2]/text()").extract() 
      item["age"] = rows.xpath(".//td[3]/text()").extract() 
      item["free"] = rows.xpath(".//td[4]/text()").extract() 
      item["team"] = rows.xpath(".//td[6]/text()").extract() 
      items.append(item) 

     return items 

然而,當我在第二個加formrequest回電,它只在刮第二個項目。它也只打印第二頁上的刮擦,就好像它完全跳過第一頁刮擦一樣? :

class MySpider(BaseSpider): 
    name = "swim" 
    start_urls = ["example.website"] 
    DOWNLAD_DELAY= 30.0 

    def parse(self, response): 
     return [FormRequest.from_response(response,formname="TTForm", 
        formdata={"Ctype":"A", "Req_Team": "", "AgeGrp": "0-6", 
        "lowage": "", "highage": "", "sex": "W", "StrkDist": "10025", 
        "How_Many": "50", "foolOldPerl": ""} 
        ,callback=self.swimparse1,dont_click=True)] 

    def swimparse1(self, response):  
     open_in_browser(response) 
     hxs = Selector(response) 
     rows = hxs.xpath(".//tr") 
     items = [] 

     for rows in rows[4:54]: 
      item = swimItem() 
      item["names"] = rows.xpath(".//td[2]/text()").extract() 
      item["age"] = rows.xpath(".//td[3]/text()").extract() 
      item["free"] = rows.xpath(".//td[4]/text()").extract() 
      item["team"] = rows.xpath(".//td[6]/text()").extract() 
      items.append(item) 
      #print item[]   
     return [FormRequest.from_response(response,formname="TTForm", 
        formdata={"Ctype":"A", "Req_Team": "", "AgeGrp": "0-6", 
        "lowage": "", "highage": "", "sex": "W", "StrkDist": "40025", 
        "How_Many": "50", "foolOldPerl": ""} 
        ,callback=self.Swimparse2,dont_click=True),] 

    def swimparse2(self, response): 
     open_in_browser(response) 
     hxs = Selector(response) 
     rows = hxs.xpath(".//tr") 
     items = [] 

     for rows in rows[4:54]: 
      item = swimItem() 
      item["names"] = rows.xpath(".//td[2]/text()").extract() 
      item["age"] = rows.xpath(".//td[3]/text()").extract() 
      item["fly"] = rows.xpath(".//td[4]/text()").extract() 
      item["team"] = rows.xpath(".//td[6]/text()").extract() 
      items.append(item) 
      #print item[] 
     return items 

猜測: A)我怎樣才能導出或返回從第一刮的項目進入第二刮,這樣我結束了所有項目的數據一起,就好像它是從一個頁面刮? B)或者如果第一次刮擦被完全跳過,我該如何停止跳過並將這些物品傳遞給下一個?

謝謝!

PS:額外的:我使用已經試過:

item = response.request.meta = ["item] 
item = response.request.meta = [] 
item = response.request.meta = ["names":item, "age":item, "free":item, "team":item] 

所有這些創建密鑰錯誤或其他異常升高

伊夫還試圖修改形式請求以包括元= {」名稱「:項目,」年齡「:項目,」免費「:項目,」團隊「:項目}。不會引發錯誤,但不會刮擦或存儲任何東西。

編輯:我嘗試使用收益,像這樣:

class MySpider(BaseSpider): 
name = "swim" 
start_urls = ["www.website.com"] 
DOWNLAD_DELAY= 30.0 

def parse(self, response): 
    open_in_browser(response) 
    hxs = Selector(response) 
    rows = hxs.xpath(".//tr") 
    items = [] 

    for rows in rows[4:54]: 
     item = swimItem() 
     item["names"] = rows.xpath(".//td[2]/text()").extract() 
     item["age"] = rows.xpath(".//td[3]/text()").extract() 
     item["free"] = rows.xpath(".//td[4]/text()").extract() 
     item["team"] = rows.xpath(".//td[6]/text()").extract() 
     items.append(item) 
     yield [FormRequest.from_response(response,formname="TTForm", 
       formdata={"Ctype":"A", "Req_Team": "", "AgeGrp": "0-6", 
       "lowage": "", "highage": "", "sex": "W", "StrkDist": "10025", 
       "How_Many": "50", "foolOldPerl": ""} 
       ,callback=self.parse,dont_click=True)] 

    for rows in rows[4:54]: 
     item = swimItem() 
     item["names"] = rows.xpath(".//td[2]/text()").extract() 
     item["age"] = rows.xpath(".//td[3]/text()").extract() 
     item["fly"] = rows.xpath(".//td[4]/text()").extract() 
     item["team"] = rows.xpath(".//td[6]/text()").extract() 
     items.append(item) 

     yield [FormRequest.from_response(response,formname="TTForm", 
       formdata={"Ctype":"A", "Req_Team": "", "AgeGrp": "0-6", 
       "lowage": "", "highage": "", "sex": "W", "StrkDist": "40025", 
       "How_Many": "50", "foolOldPerl": ""} 
       ,callback=self.parse,dont_click=True)] 

仍然沒有任何刮。我知道xpaths是正確的,因爲當我只嘗試和刮一個表單(以回報而不是收益率)時,它完美地工作。我讀過的零碎文件,它只是是不是非常有幫助:(

回答

3

你錯過了一個非常簡單的解決方案,改變returnyield

那麼你沒有積累出數組中元素,只是產量您的建議

from scrapy.selector import Selector 
from scrapy.spider import Spider 
from scrapy.http import Request 
from myproject.items import MyItem 

class MySpider(Spider): 
    name = 'example.com' 
    allowed_domains = ['example.com'] 
    start_urls = [ 
     'http://www.example.com/1.html', 
     'http://www.example.com/2.html', 
     'http://www.example.com/3.html', 
    ] 

    def parse(self, response): 
     sel = Selector(response) 
     for h3 in sel.xpath('//h3').extract(): 
      yield MyItem(title=h3) 

     for url in sel.xpath('//a/@href').extract(): 
      yield Request(url, callback=self.parse) 
+0

感謝不幸的是,收益率似乎並不被幫助進行調試PUR:儘可能多的項目和要求:從你想要的功能,scrapy將完成剩下的

scrapy docs。構成我在代碼的每個部分都包含了一個開放的瀏覽器命令。代碼停止(沒有瀏覽器打開),我得到這個:錯誤:Spider必須返回Request,BaseItem或None,在中獲得'list' 。使用yield代替下兩個return的任何組合,瀏覽器打開(代碼執行?),但不會發生刮擦。 – InfinteScroll

+0

沒關係,所以上面的評論我只是用yield來代替回報。這一次,我進一步縮減產量,以便他們將執行「for」的每個實例。然而,仍然沒有刮,並且這個錯誤被打印的次數可能是「for ...」錯誤:Spider必須返回Request,BaseItem或None,在中獲得'list' – InfinteScroll

+0

我不知道如何表明這是要走的路:)我只在我的蜘蛛中寫出產量,他們總是工作,確保在產出之前打印產品不會產生無。並確保你改變**全部**返回收益率 –

相關問題