Scrapy/Python：以收益率處理值

我正在嘗試使用Scrapy/Python編寫一個爬網程序，它從頁面讀取一些值。Scrapy/Python：以收益率處理值

然後我希望這個履帶式存儲器在最高和最低值分開的字段中存儲。到目前爲止，我能夠從頁面讀取值（請參閱下面的代碼），但我不確定如何計算最低和最高值並存儲在單獨的字段中？

舉例來說，假設履帶讀取頁面，並返回這些值

burvale分數= 75.25
里士滿分數= 85.04
索馬諾分數= ''（值缺失）
圖森分數= 90.67
雲的得分= 50.00

所以，我想填充....

'highestscore'：90.67
'lowestscore'：50.00

我該怎麼辦呢？我需要使用數組嗎？把所有的值放在數組中，然後選擇最高/最低？

另外，請注意，有2 yield在我的代碼....底部yield正在提供的網址抓取，並且第一yield實際抓取/收集從由底部yield提供的每個網址的值

任何幫助，非常感謝。如果可以，請提供代碼示例。

這是我的代碼到目前爲止 ....我存儲-1，在缺少值的情況下。

class MySpider(BaseSpider): 
    name = "courses" 
    start_urls = ['http://www.example.com/all-courses-listing'] 
    allowed_domains = ["example.com"] 
    def parse(self, response): 
    hxs = Selector(response) 
    #for courses in response.xpath(response.body): 
    for courses in response.xpath("//meta"): 
    yield { 
       'pagetype': courses.xpath('//meta[@name="pagetype"]/@content').extract_first(), 
       'pagefeatured': courses.xpath('//meta[@name="pagefeatured"]/@content').extract_first(), 
       'pagedate': courses.xpath('//meta[@name="pagedate"]/@content').extract_first(), 
       'pagebanner': courses.xpath('//meta[@name="pagebanner"]/@content').extract_first(), 
       'pagetitle': courses.xpath('//meta[@name="pagetitle"]/@content').extract_first(), 
       'pageurl': courses.xpath('//meta[@name="pageurl"]/@content').extract_first(), 
       'pagedescription': courses.xpath('//meta[@name="pagedescription"]/@content').extract_first(), 
       'pageid': courses.xpath('//meta[@name="pageid"]/@content').extract_first(), 

       'courseatarburvale': float(courses.xpath('//meta[@name="courseatar-burvale"]/@content').extract_first('').strip() or -1), 
       'courseatarrichmond': float(courses.xpath('//meta[@name="courseatar-richmond"]/@content').extract_first('').strip() or -1), 
       'courseatarsomano': float(courses.xpath('//meta[@name="courseatar-somano"]/@content').extract_first('').strip() or -1), 
       'courseatartucson': float(courses.xpath('//meta[@name="courseatar-tucson"]/@content').extract_first('').strip() or -1), 
       'courseatarcloud': float(courses.xpath('//meta[@name="courseatar-cloud"]/@content').extract_first('').strip() or -1), 
       'highestscore'; ?????? 
       'lowestscore'; ?????? 
       } 
    for url in hxs.xpath('//ul[@class="scrapy"]/li/a/@href').extract(): 
     yield Request(response.urljoin(url), callback=self.parse)

來源

2017-07-28 Slyper

我可能會打破這部分代碼：

yield { 
    'pagetype': courses.xpath('//meta[@name="pagetype"]/@content').extract_first(), 
    'pagefeatured': courses.xpath('//meta[@name="pagefeatured"]/@content').extract_first(), 
    'pagedate': courses.xpath('//meta[@name="pagedate"]/@content').extract_first(), 
    'pagebanner': courses.xpath('//meta[@name="pagebanner"]/@content').extract_first(), 
    'pagetitle': courses.xpath('//meta[@name="pagetitle"]/@content').extract_first(), 
    'pageurl': courses.xpath('//meta[@name="pageurl"]/@content').extract_first(), 
    'pagedescription': courses.xpath('//meta[@name="pagedescription"]/@content').extract_first(), 
    'pageid': courses.xpath('//meta[@name="pageid"]/@content').extract_first(), 

    'courseatarburvale': float(courses.xpath('//meta[@name="courseatar-burvale"]/@content').extract_first('').strip() or -1), 
    'courseatarrichmond': float(courses.xpath('//meta[@name="courseatar-richmond"]/@content').extract_first('').strip() or -1), 
    'courseatarsomano': float(courses.xpath('//meta[@name="courseatar-somano"]/@content').extract_first('').strip() or -1), 
    'courseatartucson': float(courses.xpath('//meta[@name="courseatar-tucson"]/@content').extract_first('').strip() or -1), 
    'courseatarcloud': float(courses.xpath('//meta[@name="courseatar-cloud"]/@content').extract_first('').strip() or -1), 
    'highestscore'; ?????? 
    'lowestscore'; ?????? 
}

到這一點：

item = { 
    'pagetype': courses.xpath('//meta[@name="pagetype"]/@content').extract_first(), 
    'pagefeatured': courses.xpath('//meta[@name="pagefeatured"]/@content').extract_first(), 
    'pagedate': courses.xpath('//meta[@name="pagedate"]/@content').extract_first(), 
    'pagebanner': courses.xpath('//meta[@name="pagebanner"]/@content').extract_first(), 
    'pagetitle': courses.xpath('//meta[@name="pagetitle"]/@content').extract_first(), 
    'pageurl': courses.xpath('//meta[@name="pageurl"]/@content').extract_first(), 
    'pagedescription': courses.xpath('//meta[@name="pagedescription"]/@content').extract_first(), 
    'pageid': courses.xpath('//meta[@name="pageid"]/@content').extract_first(), 
} 

scores = { 
    'courseatarburvale': float(courses.xpath('//meta[@name="courseatar-burvale"]/@content').extract_first('').strip() or -1), 
    'courseatarrichmond': float(courses.xpath('//meta[@name="courseatar-richmond"]/@content').extract_first('').strip() or -1), 
    'courseatarsomano': float(courses.xpath('//meta[@name="courseatar-somano"]/@content').extract_first('').strip() or -1), 
    'courseatartucson': float(courses.xpath('//meta[@name="courseatar-tucson"]/@content').extract_first('').strip() or -1), 
    'courseatarcloud': float(courses.xpath('//meta[@name="courseatar-cloud"]/@content').extract_first('').strip() or -1), 
} 

values = sorted(x for x in scores.values() if x > 0) 
scores.update({ 
    'highestscore': values[-1], 
    'lowestscore': values[0], 
}) 

item.update(scores) 
yield item

來源

2017-07-28 06:01:06

感謝@托馬斯·林哈特我在一個類似的方法工作....我會回報不久...乾杯 – Slyper

嗨@TomášLinhart當我嘗試了你的建議，我得到了這個錯誤.... _IndexError：列表索引超出範圍_任何想法？它抱怨這行'highestatar'：values [-1]，' – Slyper

@Slyper可能'values'列表是空的，如果沒有從頁面提取分數（即'scores'只包含-1），則可能發生這種情況。因此，請將最高分數分配代碼更改爲「最高分數」：值[-1]如果其他值爲無並且同樣適用於最低分數。 –

Scrapy/Python：以收益率處理值

回答

相關問題