lxml.html5parser：不工作的阿拉伯/波斯html5s

我使用LXML的html5parser 也沒關係使用ASCII字符，但如果我下載它有它裏面波斯語和俄語字符的HTML文件時，會出現此錯誤：lxml.html5parser：不工作的阿拉伯/波斯html5s

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 418: ordinal not in range(128)

這是響應文本：http://paste.ubuntu.com/23552349/

，這是我的代碼（如你所見，我只去除了所有非有效的XML字符）：

f = requests.post('http://www.example.com/getHtml.php?', headers=headers, cookies=cookies, data=data) 
resp = f.text 
if resp == "": 
    return [] 
resp = encode("utf-8") 
resp = ''.join(c for c in resp if valid_xml_char_ordinal(c)) 
doc = html5parser.fragment_fromstring(resp.encode("utf-8"), guess_charset=False, create_parent='div')

如果刪除的行：RESP =編碼（「UTF-8」）將出現這種錯誤：

ValueError異常：所有字符串必須是XML兼容：Unicode或ASCII，沒有空字節或控制字符

來源

2016-11-29 Morteza Ezzabady

懶惰的白癡與鼠標光標準備在-1：這不是重複的:)） –

我不明白你爲什麼編碼響應utf-8。如果您只將響應作爲unicode提供給fragment_fromstring，會發生什麼？ –

@StephaneMartin如果你沒有，你會得到useChardet錯誤！ –

當直接使用html5parser時，我也會遇到一些奇怪的不一致（TypeError: __init__() got an unexpected keyword argument 'useChardet'以及類似的東西）。

如果你已經安裝了lxml，使用BeautifulSoup包裝器是一件喜事。

首先安裝BeautifulSoup（pip install beautifulsoup4）。然後：

import requests 
from bs4 import BeautifulSoup 

# (initialize headers, cookies and data) 

f = requests.post('http://www.example.com/getHtml.php?', headers=headers, cookies=cookies, data=data) 
resp = f.text 
if not resp: 
    return [] 
doc = BeautifulSoup(resp, 'lxml')

然後你可以使用BeautifulSoup乾淨的API來操縱HTML樹。在底層，它仍然使用lxml進行分析。

參考了BeautifulSoup API：https://www.crummy.com/software/BeautifulSoup/bs4/doc/

來源

2016-11-29 10:33:55

https://beautiful-soup-4.readthedocs.io/ –

resp = ''.join(c for c in resp if valid_xml_char_ordinal(c))

這種試圖篩選出壞的字符不起作用，因爲在你的輸入控制字符實際上編碼爲數字字符引用，而不是原始字符：

<td class="artistFlux">السيف النشيد الدولة الإسلامية التي من شأن&#16</td>

具體爲&#16（此處由右至左的文字遮蔽）。控制字符如U + 0010（16）在HTML5 even as character references中無效。

這將是最好的，如果你能修復上游腳本產生這cruft，但如果你必須從輸入中刪除這樣的燒傷字符引用，你可以做另一個過濾器來移除像&#(3[01]|2[0-9]|1[124-9]|[0-8]])(?=[^0-9])正則表達式。

順便說一句，你不需要正常的編碼和解碼。您可以從f.content中讀取響應的原始字節並將其直接反饋給html5parser，以避免將響應解碼爲text，然後將其重新編碼爲字節。您可能還需要fragments_fromstring複數，因爲您在輸入中有兩個頂級元素。

來源

2016-11-29 22:45:14 bobince

lxml.html5parser：不工作的阿拉伯/波斯html5s

回答

相關問題