2013-01-11 39 views
0

我抓取網站的每一頁,但現在有這個問題。Scrapy - 2類 - 空白項FIelds

如果一個頁面包含類「td-cell align-right grey」和「td-cell align-right grey row-border」,則在項目['price']中寫兩個文本()。
但是,如果一個頁面只包含「td-cell align-right grey row-border」,則只能在項目['price']中寫入文本()。


驗證碼:

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from scrapy.http.request import Request 
from Test01.items import Test01Item 
from scrapy.utils.url import urljoin_rfc 
from scrapy.utils.response import get_base_url 
import urlparse 



    class ScrapyOrgSpider(BaseSpider): 
     name = "oeticket" 
     allowed_domains = ["oeticket.com"] 
     start_urls = ["http://www.oeticket.com/de/suche/?search_string=amaretto"] 

     def parse(self, response): 
      hxs = HtmlXPathSelector(response) 
      items = [] 


      next_page = hxs.select("//li[@class='next-page navigation']/a/@href").extract() 
      abs_page = [] 
      for g in next_page: 
       abs_page.append("http://oeticket.com" + g) 

      if not not abs_page: 
       for e in abs_page: 
        yield Request(e, self.parse) 

      next_event = hxs.select("//li[@class='event-item vevent']/a/@href").extract() 
      abs_event = [] 
      for it in next_event: 
       abs_event.append("http://oeticket.com" + it) 


      if not not abs_event: 
       for s in abs_event: 
        yield Request(s, self.parse) 

      deeper = hxs.select("//li[@class='performance-item vevent']/a/@href").extract() 
      abs_deeper = [] 
      for c in deeper: 
       abs_deeper.append("http://oeticket.com" + c) 

      if not not abs_deeper: 
       for d in abs_deeper: 
        yield Request(d, self.parse) 

      posts = hxs.select("//ul[@class='grid_10 horizontal-list clearfix']") 
      preis = hxs.select("//tbody/tr") 


      for post in posts: 
       item = Test01Item() 

       item["when"] = post.select("li[@class='when']/p/abbr/text()").extract() + post.select("li[@class='when']/h2/text()").extract() 
       items.append(item) 

      for post in posts: 
       item = Test01Item() 
       item["what"] = post.select("li[@class='what']/h2/text()").extract() 
       items.append(item) 

      for post in posts: 
       item = Test01Item() 
       item["where"] = post.select("li[@class='where']/h2/text()").extract() 
       items.append(item) 



      for prei in preis: 
       item = Test01Item() 
       item['url'] = response.url 
       item['price'] = prei.select("td[@class='ticket_price td-cell ucase black strong align-right']/text()").extract() 
       item['price'] = prei.select("td[@class='ticket_price td-cell ucase black strong align-right row-border']/text()").extract() 
       item["func"] = prei.select("td[@class='td-cell align-right gray']/text()").extract() 
       item["func"] = prei.select("td[@class='td-cell align-right gray row-border']/text()").extract() 

       items.append(item) 

      for item in items: 
       yield item 


結果:

{"when": ["Donnerstag, 7. Feb 2013 ", "20:00"]}, 
{"what": ["Amaretto"]}, 
{"where": ["kleines theater"]}, 
{"url": "http://www.oeticket.com/de/tickets/amaretto-salzburg-kleines-theater-482435/performance.html", "price": [], "func": []}, 
{"url": "http://www.oeticket.com/de/tickets/amaretto-salzburg-kleines-theater-482435/performance.html", "price": [" 15,90 EUR "], "func": [" Erm\u00e4\u00dfigung lt. Info - ACHTUNG: Ausweiskontrolle! "]}, 


預期的結果:

{"when": ["Donnerstag, 7. Feb 2013 ", "20:00"]}, 
{"what": ["Amaretto"]}, 
{"where": ["kleines theater"]}, 
{"url": "http://www.oeticket.com/de/tickets/amaretto-salzburg-kleines-theater-482435/performance.html", "price": [" 22,50 EUR "], "func": [" Normalpreis "}, 
{"url": "http://www.oeticket.com/de/tickets/amaretto-salzburg-kleines-theater-482435/performance.html", "price": [" 15,90 EUR "], "func": [" Erm\u00e4\u00dfigung lt. Info - ACHTUNG: Ausweiskontrolle! "]}, 

我怎樣才能解決這個問題,採用t他空白項目字段? 謝謝!

+0

爲了幫助我們分析問題,請發表您的完整蜘蛛的代碼與該網站的網址一起被刮下您的預期產出。 :) – Talvalin

+0

謝謝你試圖幫助我。我已更新該帖子。 – TheFilipo

回答

1

你必須檢查列表項,如果列表的長度爲0

item['price'] = prei.select("td[@class='ticket_price td-cell ucase black strong align-right']/text()").extract() 
if len(item['price']) == 0: 
    item['price'] = prei.select("td[@class='ticket_price td-cell ucase black strong align-right row-border']/text()").extract()