Python sys.stdin引發一個UnicodeDecodeError

我想用cURL和Python的BeautifulSoup庫編寫一個（非常）基本的網絡爬蟲，因爲這比GNU awk和一堆正則表達式更容易理解。Python sys.stdin引發一個UnicodeDecodeError

目前，我想管的網頁內容到程序與捲曲（即curl http://www.example.com/ | ./parse-html.py）

出於某種原因，Python中拋出一個UnicodeDecodeError因爲無效的開始字節（我已經看了在this answer和this answer關於無效的起始字節，但沒有弄清楚如何解決他們的問題）。

具體而言，我試着從第一個答案中使用a.encode('utf-8').split()。第二個答案只是解釋了這個問題（Python發現一個無效的起始字節），儘管它沒有給出解決方案。

我已經嘗試捲曲的輸出重定向到一個文件（即curl http://www.example.com/ > foobar.html和修改程序，接受一個文件作爲命令行參數，儘管這會導致同樣的UnicodeDecodeError。

我檢查以及locale charmap輸出爲UTF-8，這是據我所知，這意味着我的系統是UTF-8編碼字符（這使得特別是關於這個UnicodeDecodeError。

目前我迷惑，從而導致錯誤的確切行html_doc = sys.stdin.readlines().encode('utf-8').strip()。我已經嘗試將其重寫爲for循環，儘管我獲得了相同的結果ssue。

究竟是什麼導致UnicodeDecodeError，我該如何解決這個問題？

編輯： 通過改變線路html_doc = sys.stdin.readlines().encode('utf-8').strip()到html_doc = sys.stdin修復該問題

來源

2016-01-20 5donuts

的問題是在讀取過程中，不編碼;輸入資源不是用UTF-8編碼的，而是另一種編碼。在UTF-8的外殼，可以方便的與

$ echo 2¥ | iconv -t iso8859-1 | python3 -c 'import sys;sys.stdin.readline()' 
Traceback (most recent call last): 
    File "<string>", line 1, in <module> 
    File "/usr/lib/python3.5/codecs.py", line 321, in decode 
    (result, consumed) = self._buffer_decode(data, self.errors, final) 
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa5 in position 1: invalid start byte

您可以讀取文件（sys.stdin.buffer.read()，或with open(..., 'rb') as f: f.read()）爲二進制（你會得到一個bytes對象）重現該問題，仔細看了看，猜測編碼。實際算法做到這一點is documented in the HTML standard。

但是，在許多情況下，編碼不是在文件本身中指定的，而是通過HTTP Content-Type header指定的。不幸的是，你的curl調用不會捕獲這個頭文件。 Python不使用curl 和 Python，只能使用Python - 它已經是can download URLs。偷the encoding detection algorithm from youtube-dl，我們得到這樣的：

import re 
import urllib.request 


def guess_encoding(content_type, webpage_bytes): 
    m = re.match(
     r'[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+\s*;\s*charset="?([a-zA-Z0-9_-]+)"?', 
     content_type) 
    if m: 
     encoding = m.group(1) 
    else: 
     m = re.search(br'<meta[^>]+charset=[\'"]?([a-zA-Z0-9_-]+)[ /\'">]', 
         webpage_bytes[:1024]) 
     if m: 
      encoding = m.group(1).decode('ascii') 
     elif webpage_bytes.startswith(b'\xff\xfe'): 
      encoding = 'utf-16' 
     else: 
      encoding = 'utf-8' 

    return encoding 


def download_html(url): 
    with urllib.request.urlopen(url) as urlh: 
     content = urlh.read() 
     encoding = guess_encoding(urlh.getheader('Content-Type'), content) 
     return content.decode(encoding) 

print(download_html('https://phihag.de/2016/iso8859.php'))

也有一些庫（雖然不是在標準庫），它支持這個開箱即用，即requests的。我也建議您閱讀basics of what encodings are。

來源

2016-01-20 02:42:53 phihag

Python sys.stdin引發一個UnicodeDecodeError

回答

相關問題