如何使用Python版本3x從網站讀取html正文

我想連接並接收來自特定網站鏈接的http響應。我有很多的Python代碼：如何使用Python版本3x從網站讀取html正文

import urllib.request 
import os,sys,re,datetime 

fp = urllib.request.urlopen("http://www.python.org") 
mybytes = fp.read() 

mystr = mybytes.decode(encoding=sys.stdout.encoding) 
fp.close()

當我通過響應的參數： BeautifulSoup(str(mystr), 'html.parser') 得到清理HTML文本，我得到了以下錯誤：

return codecs.charmap_encode(input,self.errors,encoding_table)[0] 
UnicodeEncodeError: 'charmap' codec can't encode character '\u25bc' in position 1139: character maps to <undefined>.

問題如何我可以解決這個問題嗎？

完整代碼：

import urllib.request 
import os,sys,re,datetime 
fp = urllib.request.urlopen("http://www.python.org") 
mybytes = fp.read() 

mystr = mybytes.decode(encoding=sys.stdout.encoding) 
fp.close() 


from bs4 import BeautifulSoup 
soup = BeautifulSoup(str(mystr), 'html.parser') 
mystr = soup; 
print(mystr.get_text())

來源

2015-08-14 Fawzi Belal

BeautifulSoup是心甘情願爲消耗urlopen返回的類文件對象：如果您使用requests庫就能避免這些併發症

from urllib.request import urlopen 
from bs4 import BeautifulSoup 

with urlopen("...") as website: 
    soup = BeautifulSoup(website) 

print(soup.prettify())

來源

2015-08-14 19:59:30

不，它不起作用，所以請在回答之前試試自己！ '字符表' 編解碼器不能在位置18136編碼字符 '\ XDC'：：字符映射到 –

工程我的電腦上完全沒有使用返回codecs.charmap_encode [0] UnicodeEncodeError（輸入，self.errors，encoding_table） '「http://www.python.org」''而不是'「...」' –

不，我做到了， return codecs.charmap_encode（input，self.errors，encoding_table）[0] UnicodeEncodeError：' charmap'編解碼器無法在位置7846編碼字符'\ u25bc'：字符映射到 –

：）

import requests 
fp = requests.get("http://www.python.org") 
mystr = fp.text 

from bs4 import BeautifulSoup 
soup = BeautifulSoup(mystr, 'html.parser') 
mystr = soup; 
print(mystr.get_text())

來源

2015-08-14 20:00:13

不工作仍然有同樣的問題，請在回答之前檢查。返回codecs.charmap_encode（輸入，self.errors，encoding_table）[0] UnicodeEncodeError： '字符表' 編解碼器不能在1139位置編碼字符 '\ u25bc'：字符映射到 –

我並檢查我的代碼，這在我的電腦上運行，但現在我檢查了你的代碼，我發現它也在我的電腦上運行。這可能意味着問題不在您的代碼中。您可能需要檢查所有軟件是否已更新，並查看是否有幫助。 –

如何使用Python版本3x從網站讀取html正文

回答

相關問題