使用utf-8處理頁面

我只是在玩urllib2和使用utf-8的頁面。使用utf-8處理頁面

http://www.columbia.edu/~fdc/utf8/

只得到第一個700個字節（上段）

>>> import urllib2 
>>> from urllib2 import HTTPError, URLError 
>>> import BaseHTTPServer 
>>> opener = urllib2.OpenerDirector() 
>>> opener.add_handler(urllib2.HTTPHandler()) 
>>> opener.add_handler(urllib2.HTTPDefaultErrorHandler()) 
>>> response = opener.open('http://www.columbia.edu/~fdc/utf8/') 
>>> content = response.read(700)

從這裏

現在，我認爲，在內容VAR字符串是UTF-8編碼，應展示挺好的。

然而

>>> content 
'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">\n<html>\n<head>\n<BASE href="http://kermit.columbia.edu">\n<META http-equiv="Content-Type" content="text/html; charset=utf-8">\n<title>UTF-8 Sampler</title>\n</head>\n<body bgcolor="#ffffff" text="#000000">\n<h1><tt>UTF-8 SAMPLER</tt></h1>\n\n<big><big>&nbsp;&nbsp;\xc2\xa5&nbsp;\xc2\xb7&nbsp;\xc2\xa3&nbsp;\xc2\xb7&nbsp;\xe2\x82\xac&nbsp;\xc2\xb7&nbsp;$&nbsp;\xc2\xb7&nbsp;\xc2\xa2&nbsp;\xc2\xb7&nbsp;\xe2\x82\xa1&nbsp;\xc2\xb7&nbsp;\xe2\x82\xa2&nbsp;\xc2\xb7&nbsp;\xe2\x82\xa3&nbsp;\xc2\xb7&nbsp;\xe2\x82\xa4&nbsp;\xc2\xb7&nbsp;\xe2\x82\xa5&nbsp;\xc2\xb7&nbsp;\xe2\x82\xa6&nbsp;\xc2\xb7&nbsp;\xe2\x82\xa7&nbsp;\xc2\xb7&nbsp;\xe2\x82\xa8&nbsp;\xc2\xb7&nbsp;\xe2\x82\xa9&nbsp;\xc2\xb7&nbsp;\xe2\x82\xaa&nbsp;\xc2\xb7&nbsp;\xe2\x82\xab&nbsp;\xc2\xb7&nbsp;\xe2\x82\xad&nbsp;\xc2\xb7&nbsp;\xe2\x82\xae&nbsp;\xc2\xb7&nbsp;\xe2\x82\xaf&nbsp;\xc2\xb7&nbsp;&#8377</big></big>\n\n\n\n<p>\n<blockquote>\nFrank da Cruz<br>\n<a hre'

似乎HTML轉義，因此

>>> import HTMLParser 
>>> h = HTMLParser.HTMLParser() 
>>> h.unescape(content) 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 390, in unescape 
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 151, in sub 
    return _compile(pattern, flags).sub(repl, string, count) 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

所以我不明白。我甚至試圖做.encode（'utf-8'）正在消失，但類似的錯誤。

什麼是從網站顯示utf-8內容的最佳方式？

來源

2013-01-21 Wizzard

你爲什麼想只有前700個字符？爲什麼你不完全解析文檔並從中提取數據？而且還有lxml，Beautifulsoup等。 –

我得到的僅僅是第一個700，所以我只能看到utf8字符，我不想測試整個頁面，只是第一個序列。 – Wizzard

你需要解碼從UTF-8到Unicode的頁面;有是 UTF-8序列在那裏（旁非破空間HTML實體）：

>>> print h.unescape(content.decode('utf8')) 
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> 
<html> 
<head> 
<BASE href="http://kermit.columbia.edu"> 
<META http-equiv="Content-Type" content="text/html; charset=utf-8"> 
<title>UTF-8 Sampler</title> 
</head> 
<body bgcolor="#ffffff" text="#000000"> 
<h1><tt>UTF-8 SAMPLER</tt></h1> 

<big><big>  ¥ · £ · € · $ · ¢ · ₡ · ₢ · ₣ · ₤ · ₥ · ₦ · ₧ · ₨ · ₩ · ₪ · ₫ · ₭ · ₮ · ₯ · &#8377</big></big> 



<p> 
<blockquote> 
Frank da Cruz<br> 
<a hre

你有編碼和解碼混淆;內容已經是UTF-8編碼。

請注意，&#8377是頁面本身的錯誤，因此省略了;。一個HTML5分析器或瀏覽器可能會假設;可以添加反正它解碼：

>>> print h.unescape('&#8377;') 
₹

你不得不解決這些實體使用正則表達式第一：

>>> import re 
>>> brokenrefs = re.compile(r'(&#x?[a-e0-9]+)\b', re.I) 
>>> print h.unescape(brokenrefs.sub(r'\1;', content.decode('utf8'))) 
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> 
<html> 
<head> 
<BASE href="http://kermit.columbia.edu"> 
<META http-equiv="Content-Type" content="text/html; charset=utf-8"> 
<title>UTF-8 Sampler</title> 
</head> 
<body bgcolor="#ffffff" text="#000000"> 
<h1><tt>UTF-8 SAMPLER</tt></h1> 

<big><big>  ¥ · £ · € · $ · ¢ · ₡ · ₢ · ₣ · ₤ · ₥ · ₦ · ₧ · ₨ · ₩ · ₪ · ₫ · ₭ · ₮ · ₯ · ₹</big></big> 



<p> 
<blockquote> 
Frank da Cruz<br> 
<a hre

來源

2013-01-21 11:46:06

解碼期望'utf-8'而不是'utf8'，對吧？ >> print h.unescape（content.decode（'utf8'）） – spazm

這沒關係;兩者都是'utf_8'編解碼器的別名。請參閱[標準編解碼器列表]（http://docs.python.org/2/library/codecs.html#standard-encodings）。 –

您已經誤解了你的輸出。這裏沒有任何HTML編碼：但只要在REPL中鍵入content，它就會顯示repr()版本的文本。

做print content爲您提供您所期望的：

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> 
<html> 
<head> 
<BASE href="http://kermit.columbia.edu"> 
<META http-equiv="Content-Type" content="text/html; charset=utf-8"> 
<title>UTF-8 Sampler</title> 
</head> 
<body bgcolor="#ffffff" text="#000000"> 
<h1><tt>UTF-8 SAMPLER</tt></h1> 

<big><big>&nbsp;&nbsp;¥&nbsp;·&nbsp;£&nbsp;·&nbsp;€&nbsp;·&nbsp;$&nbsp;·&nbsp;¢&nbsp;·&nbsp;₡&nbsp;·&nbsp;₢&nbsp;·&nbsp;₣&nbsp;·&nbsp;₤&nbsp;·&nbsp;₥&nbsp;·&nbsp;₦&nbsp;·&nbsp;₧&nbsp;·&nbsp;₨&nbsp;·&nbsp;₩&nbsp;·&nbsp;₪&nbsp;·&nbsp;₫&nbsp;·&nbsp;₭&nbsp;·&nbsp;₮&nbsp;·&nbsp;₯&nbsp;·&nbsp;&#8377</big></big> 



<p> 
<blockquote> 
Frank da Cruz<br> 
<a hre

來源

2013-01-21 11:47:24

這是因爲你的終端使用UTF-8編碼。內容*是* UTF-8編碼的，OP應該從那裏解碼（然後讓Python的'print'語句檢測終端輸出編碼）。 –

此外，經過這些前700字節，字符有更多HTML實體，*和*在''元素之間的文本中的HTML實體中存在錯誤，其中分號被丟棄。 –

使用utf-8處理頁面

回答

相關問題