如何將BeautifulSoup.ResultSet轉換爲字符串

因此，我使用.findAll（BeautifulSoup）解析了一個html頁面，並將其改名爲result。如果我在Python shell中鍵入result然後按Enter鍵，我看到如預期正常的文字，但我想後處理這個結果作爲字符串對象，我注意到，str(result)回報垃圾，像這樣的例子：如何將BeautifulSoup.ResultSet轉換爲字符串

\xd1\x87\xd0\xb8\xd0\xbb\xd0\xbd\xd0\xb8\xd1\x86\xd0\xb0</a><br />\n<hr />\n</div>

的Html頁面源代碼是utf-8編碼

我該如何處理？

代碼基本上是這樣，如果它的事項：

from BeautifulSoup import BeautifulSoup 
soup = BeautifulSoup(urllib.open(url).read()) 
result = soup.findAll(something)

Python是2.7

來源

2011-10-16 theta

顯示您的代碼請點擊這裏 – cetver

Python 2.6.7 BeautifulSoup。版本 3.2.0

這爲我工作：

unicode.join(u'\n',map(unicode,result))

我敢肯定有result是BeautifulSoup.ResultSet對象，這似乎是標準的Python列表

來源

2012-03-26 01:15:41

這不是垃圾，這是UTF-8編碼的文本。 Use Unicode instead.

來源

2011-10-16 06:43:10

它是常用的用於描述字符解碼/編碼問題的術語，肯定不是字面上的垃圾 – theta

但是沒有問題。這是UTF-8編碼的文本;你只是不認識它。 –

使用此：

unicodedata.normalize('NFKC', p.decode()).encode('ascii','ignore')

Unicode有multiple normalization forms 這輸出不應該是垃圾。
使用originalEncoding屬性驗證編碼方案。
關於python的unicode實現，請參閱this document（即使爲標準化）

來源

2011-10-16 06:43:27

'soup.originalEncoding'返回'utf-8'。 '結果'這是BS.ResultSet對象不支持這個屬性。我當然不想解碼'utf-8'並編碼爲ASCII，因爲我將所有外來（英文）字符都丟失了。我想從這個BS.ResultSet對象中獲得'utf-8'字符串對象 – theta

你有沒有試過通過@ Ignacio的答案給出的鏈接？ –

from BeautifulSoup import BeautifulSoup 
soup = BeautifulSoup(urllib.open(url).read()) 
#findAll should get multiple parsed result 
result = soup.findAll(something) 
#then iterate result 
for line in result: 
    #get str value from each line,replace charset with utf-8 or other charset you need 
    print line.__str__('charset')

的延伸

BTW：BeautifulSoup的版本是beautifulsoup-3.2.1

來源

2013-08-22 15:30:39 ChangePicture

如何將BeautifulSoup.ResultSet轉換爲字符串

回答

相關問題