2016-07-31 28 views
2

我目前正在使用python請求和lxml庫編寫一個小測試webscraper。我試圖使用xpaths從this site的表格行中提取文本來唯一標識表格。由於表本身只能通過它的類名來標識,並且考慮到類名不是唯一的,所以爲了指定表,我必須使用父div元素。有問題的表,其中列出了季節順序,拍攝的日期,並airdates爲寶座的表演比賽,這我想有以下路徑選擇:如何解析來自html表格元素的文本

tree.xpath('//div[@id = "mw-content-text"]//table[@class = "wikitable"]//text()') 

出於某種原因,當我在shell中打印此路徑,它將返回一個空列表。我相信打印這個路徑只會顯示錶格中我想要做的所有文本,以確保我能夠真正獲得內容;不過,我實際上需要打印表格的每一行。

這個xpath有什麼問題嗎?如果是這樣,打印表格內容的正確方法是什麼?

回答

2

wikitable是太寬泛的一類來區分wiki頁上的表彼此之間。

我反而依靠前面Adaptation schedule標籤:

import requests 
from lxml.html import fromstring 

url = "https://en.wikipedia.org/wiki/Game_of_Thrones" 
response = requests.get(url) 
root = fromstring(response.content) 

table = root.xpath(".//h3[span = 'Adaptation schedule']/following-sibling::table")[0] 
for row in table.xpath(".//tr")[1:]: 
    print([cell.text_content() for cell in row.xpath(".//td")]) 

打印:

['Season 1', 'March 2, 2010[52]', 'Second half of 2010', 'April 17, 2011', 'June 19, 2011', 'A Game of Thrones'] 
['Season 2', 'April 19, 2011[53]', 'Second half of 2011', 'April 1, 2012', 'June 3, 2012', 'A Clash of Kings and some early chapters from A Storm of Swords[54]'] 
['Season 3', 'April 10, 2012[55]', 'Second half of 2012', 'March 31, 2013', 'June 9, 2013', 'About the first two-thirds of A Storm of Swords[56][57]'] 
['Season 4', 'April 2, 2013[58]', 'Second half of 2013', 'April 6, 2014', 'June 15, 2014', 'The remaining one-third of A Storm of Swords and some elements from A Feast for Crows and A Dance with Dragons[59]'] 
['Season 5', 'April 8, 2014[60]', 'Second half of 2014', 'April 12, 2015', 'June 14, 2015', 'A Feast for Crows, A Dance with Dragons and original content,[61] with some late chapters from A Storm of Swords[62] and elements from The Winds of Winter[63][64]'] 
['Season 6', 'April 8, 2014[60]', 'Second half of 2015', 'April 24, 2016', 'June 26, 2016', 'Original content and outlined from The Winds of Winter,[65][66] with some late elements from A Feast for Crows and A Dance with Dragons[67]'] 
['Season 7', 'April 21, 2016[50]', 'Second half of 2016[49]', 'Mid-2017[5]', 'Mid-2017[5]', 'Original content and outlined from The Winds of Winter and A Dream of Spring[66]']