在BeautifulSoup中處理印度語言

我試圖抓取新聞標題的NDTV網站。 This是我用作HTML源的頁面。我使用BeautifulSoup（bs4）來處理HTML代碼，並且我已經完成了所有工作，除了當我在鏈接的頁面上遇到印地語標題時，代碼會中斷。在BeautifulSoup中處理印度語言

到目前爲止我的代碼是：

import urllib2 
from bs4 import BeautifulSoup 

htmlUrl = "http://archives.ndtv.com/articles/2012-01.html" 
FileName = "NDTV_2012_01.txt" 

fptr = open(FileName, "w") 
fptr.seek(0) 

page = urllib2.urlopen(htmlUrl) 
soup = BeautifulSoup(page, from_encoding="UTF-8") 

li = soup.findAll('li') 
for link_tag in li: 
    hypref = link_tag.find('a').contents[0] 
    strhyp = str(hypref) 
    fptr.write(strhyp) 
    fptr.write("\n")

我得到的錯誤是：

Traceback (most recent call last): 
    File "./ScrapeTemplate.py", line 30, in <module> 
    strhyp = str(hypref) 
    UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)

我得到了同樣的錯誤，即使我不包括from_encoding參數。我最初使用它作爲fromEncoding，但python警告我說，這是不推薦使用。

我該如何解決這個問題？從我讀過的內容中，我需要避免使用印地語標題或將其明確編碼爲非ASCII文本，但我不知道該如何做。任何幫助將不勝感激！

來源

2013-01-19 Kitchi

你看到的是一個NavigableString實例（從Python的Unicode類型派生）：

(Pdb) hypref.encode('utf-8') 
'NDTV' 
(Pdb) hypref.__class__ 
<class 'bs4.element.NavigableString'> 
(Pdb) hypref.__class__.__bases__ 
(<type 'unicode'>, <class 'bs4.element.PageElement'>)

你需要轉換爲UTF-8使用

hypref.encode('utf-8')

來源

2013-01-19 09:32:50

strhyp = hypref.encode('utf-8')

http://joelonsoftware.com/articles/Unicode.html

來源

2013-01-19 09:28:31

在BeautifulSoup中處理印度語言

回答

相關問題