Python Scrapy Xpath？

對於一個非營利性的大學作業，我試圖從網站www.rateyourmusic.com使用Python中的scrapy框架來抓取數據，我已經取得了少量成功，因爲我已經能夠刮掉一個名稱來自藝術家頁面的藝術家，但其他信息（出生日期，國籍）的xpath對我來說很難被刮掉。你們中的任何人都知道這些對象的正確xpath是什麼嗎？這裏是我的解析方法，至少爲藝術家的名字工作。Python Scrapy Xpath？

def parse_dir_contents(self, response): 
    item = rateyourmusicartist() 

    for sel in response.xpath('//div/div/div/div/table/tbody/tr/td'): 
     item['dateofbirth'] = sel.xpath('td/text()').extract() #these two selectors aren't working 
     item['nationality'] = sel.xpath('td/a/text()').extract() 

    for sel in response.xpath('//div/div/div/div/div/h1'): 
     item['name'] = sel.xpath('text()').extract() #this is the one that works 

    yield item

這裏是我刮http://rateyourmusic.com/artist/kanye_west

來源

2015-10-22 user3545370

卸下'TD /'從兩個目前不工作的XPath。然後他們應該工作。 – gtlambert

感謝您的注意，不幸的是我已經嘗試過這樣做了，它不起作用，我添加了td /以查看是否會有所作爲，在兩個單獨的while循環中進行解析會有所作爲？我假設我將不得不因爲他們在頁面的不同部分來源 – user3545370

你的問題是你在虛擬DOM上傳遞（我猜你看着檢查員來獲取HTML結構）。您必須檢查頁面上的真實來源。 F.X.在頁面上沒有tbody標籤，但只在虛擬DOM中。 –

這裏的藝術家頁面的URL樣本HTML是真正的片段您有一個頁面上（你可以看到它，如果你打開網頁的源）。

<table class="artist_info"> 
<tr><td><div class="info_hdr">Born</div> June 8, 1977, <a class="location" href="/location/Atlanta/GA/United States">Atlanta, GA, United States</a></td></tr> 
<tr><td><div class="info_hdr">Currently</div><a class="location" href="/location/Hidden Hills/CA/United States">Hidden Hills, CA, United States</a></td></tr> 
</table>

爲了得到生日運行SUHC XPAGE（在表格第一行的內容）

//table[@class='artist_info']/tr[1]/td/text()

結果

'1977年6月8日，'

目前爲獲得運行SUHC XPAGE

//table[@class='artist_info']/tr[2]/td/a/text()

結果（表2-ND行內容）

'隱藏山，CA，美國'

來源

2015-10-22 13:01:27

輝煌，工作像一個魅力，謝謝 – user3545370

Python Scrapy Xpath？

回答

相關問題