utf8編解碼器無法解碼python中的字節0x96

我正在嘗試檢查某個單詞是否位於許多網站的頁面上。該腳本運行良好，說15個網站，然後停止。utf8編解碼器無法解碼python中的字節0x96

的UnicodeDecodeError：「UTF-8」編解碼器不能解碼位置15344字節0x96：無效的起始字節

我做了一個計算器搜索和發現了很多問題，但我似乎無法理解在我的情況下出了問題。

我想解決它，或者如果跳過該網站有錯誤。請教我如何做到這一點，因爲我是新手，下面的代碼本身讓我花了一天的時間寫作。順便說該腳本上暫停該網站是http://www.homestead.com

filetocheck = open("bloglistforcommenting","r") 
resultfile = open("finalfile","w") 

for countofsites in filetocheck.readlines(): 
     sitename = countofsites.strip() 
     htmlfile = urllib.urlopen(sitename) 
     page = htmlfile.read().decode('utf8') 
     match = re.search("Enter your name", page) 
     if match: 
      print "match found : " + sitename 
      resultfile.write(sitename+"\n") 

     else: 
      print "sorry did not find the pattern " +sitename 

print "Finished Operations"

按照馬克的意見，我改變了代碼來實現beautifulsoup

htmlfile = urllib.urlopen("http://www.homestead.com") 
page = BeautifulSoup((''.join(htmlfile))) 
print page.prettify()

現在我收到此錯誤

page = BeautifulSoup((''.join(htmlfile))) 
TypeError: 'module' object is not callable

我正在嘗試從http://www.crummy.com/software/BeautifulSoup/documentation.html#Quick%20Start開始的快速入門示例。如果我複製粘貼它然後代碼工作正常。

我最終得到它的工作。感謝大家的幫助。這是最終的代碼。

import urllib 
import re 
from BeautifulSoup import BeautifulSoup 

filetocheck = open("listfile","r") 

resultfile = open("finalfile","w") 
error ="for errors" 

for countofsites in filetocheck.readlines(): 
     sitename = countofsites.strip() 
     htmlfile = urllib.urlopen(sitename) 
     page = BeautifulSoup((''.join(htmlfile))) 
     pagetwo =str(page) 
     match = re.search("Enter YourName", pagetwo) 
     if match: 
      print "match found : " + sitename 
      resultfile.write(sitename+"\n") 

     else: 
      print "sorry did not find the pattern " +sitename 

print "Finished Operations"

來源

2011-10-24 Vishal Khialani

許多網頁編碼不正確。解析HTML請嘗試BeautifulSoup，因爲它可以處理在野外發現的許多類型的錯誤HTML。

Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:

Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.

Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.

Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding.

強調我的。

來源

2011-10-24 09:29:44

我寧願跳過這個網站，我可以像解碼一樣做（'utf8'，somecodeforerrortoskip） –

user976847：使用BeautifulSoup還有很多其他優勢。我認爲你應該放棄它。 –

我看看它謝謝 –

該網站 'http://www.homestead.com' 並不聲稱向您發送UTF-8，反應居然聲稱是ISO-8859-1：

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

你必須爲您實際收到的網頁使用正確的編碼，而不是隨機猜測。

來源

2011-10-24 09:35:25 Duncan

事情是我有一個巨大的網站列表，這只是第一個的許多錯誤。如果我發現解碼錯誤，跳過網站的最佳方式是什麼？ –

'charset = ISO-8859-1'是「郵件中的支票」的網絡等價物。 –

15344處的字節是0x96。推測在位置15343處有一個字符的單字節編碼或多字節編碼的最後一個字節，使15344成爲字符的開始。 0x96是二進制10010110，任何與模式10XXXXXX（0x80到0xBF）匹配的字節只能是UTF-8編碼中的第二個或後續字節。

因此，流不是UTF-8，否則會損壞。

檢查您鏈接到URI，我們找到頭：

Content-Type: text/html

由於沒有編碼聲明，我們應該使用HTTP的默認，這是ISO-8859-1（又名「拉丁1 「）。

檢查發現行內容：

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

這是人誰是，由於某種原因，無法正確地設置自己的HTTP標題的回退機制。這次我們明確地告訴字符編碼是ISO-8859-1。

因此，沒有理由期望將其作爲UTF-8工作。

對於額外的樂趣，雖然，當我們考慮到在ISO-8859-1編碼0x96 U + 0096是控制字符「防護區域開始」，我們發現，ISO-8859-1不正確或者 。看起來創建頁面的人對你自己犯了類似的錯誤。

從上下文來看，他們似乎實際上使用了Windows-1252，因爲在編碼0x96編碼U + 2013（EN-DASH，看起來像–）。

因此，解析這個特定的頁面，你想在Windows-1252解碼。更一般地說，當你選擇字符編碼時，你想要檢查標題，雖然在這種情況下它可能是不正確的（或者，也許不是，多個「ISO-8859-1」編解碼器實際上是Windows-1252），你會更經常地改正。通過閱讀和回退，你仍然需要有這樣的失誤。 decode方法採用稱爲errors的第二個參數。默認值爲'strict'，但您也可以有'ignore','replace','xmlcharrefreplace'（不適用），'backslashreplace'（不適用），並且您可以使用codecs.register_error()註冊自己的回退處理程序。

來源

2011-10-24 09:58:35

要修復嵌入在utf-8中的Windows-1252內容，您可以使用['bs4.UnicodeDammit.detwingle（）']（http://www.crummy.com/software/BeautifulSoup/bs4/doc/#inconsistent-encodings ） – jfs

深入解答，解釋錯誤（幾乎肯定）是什麼。不幸的是，如果不在字節級別上理解這些東西是不可能的，當然，很多人還沒有做好準備。感謝您多走一步:-) – Forbesmyester

utf8編解碼器無法解碼python中的字節0x96

回答

相關問題