卡住刮特定表scrapy

所以我想湊表可以在這裏找到：http://www.betdistrict.com/tipsters 卡住刮特定表scrapy

名爲「六月統計信息」表後我。

這裏是我的蜘蛛：

from __future__ import division 
from decimal import * 

import scrapy 
import urlparse 

from ttscrape.items import TtscrapeItem 

class BetdistrictSpider(scrapy.Spider): 
name = "betdistrict" 
allowed_domains = ["betdistrict.com"] 
start_urls = ["http://www.betdistrict.com/tipsters"] 

def parse(self, response): 
    for sel in response.xpath('//table[1]/tr'): 
     item = TtscrapeItem() 
     name = sel.xpath('td[@class="tipst"]/a/text()').extract()[0] 
     url = sel.xpath('td[@class="tipst"]/a/@href').extract()[0] 
     tipster = '<a href="' + url + '" target="_blank" rel="nofollow">' + name + '</a>' 
     item['Tipster'] = tipster 
     won = sel.xpath('td[2]/text()').extract()[0] 
     lost = sel.xpath('td[3]/text()').extract()[0] 
     void = sel.xpath('td[4]/text()').extract()[0] 
     tips = int(won) + int(void) + int(lost) 
     item['Tips'] = tips 
     strike = Decimal(int(won)/tips) * 100 
     strike = str(round(strike,2)) 
     item['Strike'] = [strike + "%"] 
     profit = sel.xpath('//td[5]/text()').extract()[0] 
     if profit[0] in ['+']: 
      profit = profit[1:] 
     item['Profit'] = profit 
     yield_str = sel.xpath('//td[6]/text()').extract()[0] 
     yield_str = yield_str.replace(' ','') 
     if yield_str[0] in ['+']: 
      yield_str = yield_str[1:] 
     item['Yield'] = '<span style="color: #40AA40">' + yield_str + '%</span>' 
     item['Site'] = 'Bet District' 
     yield item

這給了我一個列表索引超出範圍的錯誤的第一個變量（名稱）。

然而，當我重寫我的XPath選擇開始//，e.g：

name = sel.xpath('//td[@class="tipst"]/a/text()').extract()[0]

蜘蛛運行，但一遍又一遍刮掉第一線人。

我認爲這與表沒有一個thead，但在tbody的第一個tr中包含th標籤有關。

任何幫助，非常感謝。

---------- ----------編輯

針對拉爾斯建議：

我試圖用你提出什麼但仍得到超出範圍的錯誤列表：

from __future__ import division 
from decimal import * 

import scrapy 
import urlparse 

from ttscrape.items import TtscrapeItem 

class BetdistrictSpider(scrapy.Spider): 
    name = "betdistrict" 
    allowed_domains = ["betdistrict.com"] 
    start_urls = ["http://www.betdistrict.com/tipsters"] 

def parse(self, response): 
    for sel in response.xpath('//table[1]/tr[td[@class="tipst"]]'): 
     item = TtscrapeItem() 
     name = sel.xpath('a/text()').extract()[0] 
     url = sel.xpath('a/@href').extract()[0] 
     tipster = '<a href="' + url + '" target="_blank" rel="nofollow">' + name + '</a>' 
     item['Tipster'] = tipster 
     yield item

另外，我做的事情這樣假設，多爲循環需要，因爲不是所有的細胞具有相同的類？

我也嘗試做的事情，而沒有for循環，但在這種情況下，它再次刮只有第一個線人多次：當您們的說法

感謝

來源

2015-06-10 preach

name = sel.xpath('td[@class="tipst"]/a/text()').extract()[0]

XPath表達式以td開頭，所以相對於變量sel中的上下文節點（即tr元素中的tr元素表示for循環迭代）。

但是，當你說

name = sel.xpath('//td[@class="tipst"]/a/text()').extract()[0]

XPath表達式與//td開始，即選擇文檔中的任何地方都td元素;這與sel不相關，所以在for循環的每次迭代中結果都是相同的。這就是爲什麼它一遍又一遍地刮傷了第一位技巧。

爲什麼第一個XPath表達式失敗，並且列表索引超出範圍錯誤？嘗試一次將XPath表達式一步一步地打印出來，然後很快就會發現問題。在這種情況下，這似乎是因爲table[1]的第一個tr孩子沒有td孩子（只有th孩子）。因此，xpath()什麼也沒有選擇，extract()返回一個空列表，並且您嘗試引用該空列表中的第一個項目，給出列表索引超出範圍錯誤。

for sel in response.xpath('//table[1]/tr[td]'):

你可以讓發燒友，需要正確類的td：

爲了解決這個問題，你可以爲循環XPath表達式只在有td了孩子們tr元素改變你的循環

for sel in response.xpath('//table[1]/tr[td[@class="tipst"]]'):

來源

2015-06-10 16:25:38 LarsH

感謝您的回覆拉爾斯。自從試圖實現這一點以來，我已經添加了一個編輯，但仍然沒有運氣！ – preach

@preach，儘管我們已經改變了for循環語句的XPath表達式，但sel仍然保存着tr元素而不是td元素。這是因爲XPath謂詞（方括號內的內容）不表示進一步的位置步驟;他們只是篩選你已經選擇的'tr's。因此，您需要將'name'的XPath更改爲'td [@ class =「tipst」]/a/text（）'，而不僅僅是'a/text（）'。 – LarsH

卡住刮特定表scrapy

回答

相關問題