2014-01-31 19 views
0

我試圖從Python 3.3的網頁獲取文本,然後搜索特定字符串的文本。當我找到匹配的字符串時,我需要保存以下文本。例如,我拿這個頁面:http://gatherer.wizards.com/Pages/Card/Details.aspx?name=Dark%20Prophecy 我需要保存卡片信息中每個類別(卡片文本,稀有等)後的文本。 目前我使用美麗的湯,但get_text導致UnicodeEncodeError,並沒有返回一個可迭代的對象。下面是相關代碼:從網頁獲取文本作爲python 3.3中的可迭代對象

urlStr = urllib.request.urlopen(
    'http://gatherer.wizards.com/Pages/Card/Details.aspx?name=' + cardName 
    ).read() 

htmlRaw = BeautifulSoup(urlStr) 

htmlText = htmlRaw.get_text 

for line in htmlText: 
    line = line.strip() 
    if "Converted Mana Cost:" in line: 
     cmc = line.next() 
     message += "*Converted Mana Cost: " + cmc +"* \n\n" 
    elif "Types:" in line: 
     type = line.next() 
     message += "*Type: " + type +"* \n\n" 
    elif "Card Text:" in line: 
     rulesText = line.next() 
     message += "*Rules Text: " + rulesText +"* \n\n" 
    elif "Flavor Text:" in line: 
     flavor = line.next() 
     message += "*Flavor Text: " + flavor +"* \n\n" 
    elif "Rarity:" in line: 
     rarity == line.next() 
     message += "*Rarity: " + rarity +"* \n\n" 
+0

請包括您從錯誤中獲得的完整回溯。 –

+0

有很多更好的工具來處理HTML解析和刮擦比這個 –

+0

@Guy所以爲什麼不命名一些? –

回答

0

這是不正確的:

htmlText = htmlRaw.get_text 

由於get_textBeautifulSoup類的方法,你分配方法htmlText,而不是它的結果。這裏是它的一個變種財產會做你想要的東西在這裏:

htmlText = htmlRaw.text 

你也使用HTML解析器簡單地剝去標籤,當你可以用它來針對你想要的數據:

# unique id for the html section containing the card info 
card_id = 'ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_rightCol' 

# grab the html section with the card info 
card_data = htmlRaw.find(id=card_id) 

# create a generator to iterate over the rows 
card_rows = (row for row in card_data.find_all('div', 'row')) 

# create a generator to produce functions for retrieving the values 
card_rows_getters = (lambda x: row.find('div', x).text.strip() for row in card_rows) 

# create a generator to get the values 
card_values = ((get('label'), get('value')) for get in card_rows_getters) 

# dump them into a dictionary 
cards = dict(card_values) 

print cards 

{u'Artist:': u'Scott Chou', 
u'Card Name:': u'Dark Prophecy', 
u'Card Number:': u'93', 
u'Card Text:': u'Whenever a creature you control dies, you draw a card and lose 1 life.', 
u'Community Rating:': u'Community Rating: 3.617/5\xa0\xa0(64 votes)', 
u'Converted Mana Cost:': u'3', 
u'Expansion:': u'Magic 2014 Core Set', 
u'Flavor Text:': u'When the bog ran short on small animals, Ekri turned to the surrounding farmlands.', 
u'Mana Cost:': u'', 
u'Rarity:': u'Rare', 
u'Types:': u'Enchantment'} 

現在你有一個你想要的信息的字典(加上一些額外的),這將是一個更容易處理。

+0

當我使用這個時,我得到一個錯誤「AttributeError:'NoneType'對象沒有線上的屬性'find_all'card_rows ... – CrazyBurrito

+0

BeautifulSoup使用哪個版本?3或4?另外,如果你打印'card_data'? –

+0

我想我使用4.4,當我打印card_data時,它給出無。如果我打印它的str(card_data)它打印頁面的HTML – CrazyBurrito