從網頁獲取文本作爲python 3.3中的可迭代對象

我試圖從Python 3.3的網頁獲取文本，然後搜索特定字符串的文本。當我找到匹配的字符串時，我需要保存以下文本。例如，我拿這個頁面：http://gatherer.wizards.com/Pages/Card/Details.aspx?name=Dark%20Prophecy 我需要保存卡片信息中每個類別（卡片文本，稀有等）後的文本。目前我使用美麗的湯，但get_text導致UnicodeEncodeError，並沒有返回一個可迭代的對象。下面是相關代碼：從網頁獲取文本作爲python 3.3中的可迭代對象

urlStr = urllib.request.urlopen(
    'http://gatherer.wizards.com/Pages/Card/Details.aspx?name=' + cardName 
    ).read() 

htmlRaw = BeautifulSoup(urlStr) 

htmlText = htmlRaw.get_text 

for line in htmlText: 
    line = line.strip() 
    if "Converted Mana Cost:" in line: 
     cmc = line.next() 
     message += "*Converted Mana Cost: " + cmc +"* \n\n" 
    elif "Types:" in line: 
     type = line.next() 
     message += "*Type: " + type +"* \n\n" 
    elif "Card Text:" in line: 
     rulesText = line.next() 
     message += "*Rules Text: " + rulesText +"* \n\n" 
    elif "Flavor Text:" in line: 
     flavor = line.next() 
     message += "*Flavor Text: " + flavor +"* \n\n" 
    elif "Rarity:" in line: 
     rarity == line.next() 
     message += "*Rarity: " + rarity +"* \n\n"

來源

2014-01-31 CrazyBurrito

請包括您從錯誤中獲得的完整回溯。 –

有很多更好的工具來處理HTML解析和刮擦比這個 –

@Guy所以爲什麼不命名一些？ –

這是不正確的：

htmlText = htmlRaw.get_text

由於get_text是BeautifulSoup類的方法，你分配方法到htmlText，而不是它的結果。這裏是它的一個變種財產會做你想要的東西在這裏：

htmlText = htmlRaw.text

你也使用HTML解析器簡單地剝去標籤，當你可以用它來針對你想要的數據：

# unique id for the html section containing the card info 
card_id = 'ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_rightCol' 

# grab the html section with the card info 
card_data = htmlRaw.find(id=card_id) 

# create a generator to iterate over the rows 
card_rows = (row for row in card_data.find_all('div', 'row')) 

# create a generator to produce functions for retrieving the values 
card_rows_getters = (lambda x: row.find('div', x).text.strip() for row in card_rows) 

# create a generator to get the values 
card_values = ((get('label'), get('value')) for get in card_rows_getters) 

# dump them into a dictionary 
cards = dict(card_values) 

print cards 

{u'Artist:': u'Scott Chou', 
u'Card Name:': u'Dark Prophecy', 
u'Card Number:': u'93', 
u'Card Text:': u'Whenever a creature you control dies, you draw a card and lose 1 life.', 
u'Community Rating:': u'Community Rating: 3.617/5\xa0\xa0(64 votes)', 
u'Converted Mana Cost:': u'3', 
u'Expansion:': u'Magic 2014 Core Set', 
u'Flavor Text:': u'When the bog ran short on small animals, Ekri turned to the surrounding farmlands.', 
u'Mana Cost:': u'', 
u'Rarity:': u'Rare', 
u'Types:': u'Enchantment'}

現在你有一個你想要的信息的字典（加上一些額外的），這將是一個更容易處理。

來源

2014-01-31 01:28:53

當我使用這個時，我得到一個錯誤「AttributeError：'NoneType'對象沒有線上的屬性'find_all'card_rows ... – CrazyBurrito

BeautifulSoup使用哪個版本？3或4？另外，如果你打印'card_data'？ –

我想我使用4.4，當我打印card_data時，它給出無。如果我打印它的str（card_data）它打印頁面的HTML – CrazyBurrito

從網頁獲取文本作爲python 3.3中的可迭代對象

回答

相關問題