LXML解析HTML返回空的結果，而beautifulsoup返回合理的解析

我也明白，傳統上他們說LXML比BeautifulSoup嚴格，不過，我不明白的是以下幾點：LXML解析HTML返回空的結果，而beautifulsoup返回合理的解析

（基本上我請求網頁，在中國，並希望選擇一些跨度。類似網頁可以在沒有錯誤的工作，但對於一些鏈接限於lxml只是無法解析）

In [1]: headers = {'User-Agent': ''} 

In [2]: url = 'http://basic.10jqka.com.cn/600219/company.html' 

In [3]: headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:46.0) Gecko/20100101 Firefox/46.0'} 

In [6]: import lxml.html 

In [7]: res = requests.get(url, headers=headers) 

In [8]: tree = lxml.html.fromstring(res.content) 
--------------------------------------------------------------------------- 
ParserError        Traceback (most recent call last) 
<ipython-input-8-b512dc78ed68> in <module>() 
----> 1 tree = lxml.html.fromstring(res.content) 

/home/jgu/repos/.venv36/lib/python3.6/site-packages/lxml/html/__init__.py in fromstring(html, base_url, parser, **kw) 
    874  else: 
    875   is_full_html = _looks_like_full_html_unicode(html) 
--> 876  doc = document_fromstring(html, parser=parser, base_url=base_url, **kw) 
    877  if is_full_html: 
    878   return doc 

/home/jgu/repos/.venv36/lib/python3.6/site-packages/lxml/html/__init__.py in document_fromstring(html, parser, ensure_head_body, **kw) 
    763  if value is None: 
    764   raise etree.ParserError(
--> 765    'Document is empty') 
    766  if ensure_head_body and value.find('head') is None: 
    767   value.insert(0, Element('head')) 

ParserError: Document is empty 

In [12]: from bs4 import BeautifulSoup 

In [13]: soup = BeautifulSoup(res.content, 'html.parser') 

In [14]: soup.title 
Out[14]: <title>南山鋁業(600219) 公司資料_F10_同花順金融服務網</title> 

In [15]: sel_query = (
    ...:  '#detail > div.bd > table > tbody > tr:nth-of-type(1) > ' 
    ...:  'td:nth-of-type(2) > span' 
    ...:) 

In [16]: soup.select(sel_query) 
Out[16]: [<span>山東南山鋁業股份有限公司</span>] 

In [17]: soup.select(sel_query)[0].text 
Out[17]: '山東南山鋁業股份有限公司'

正如我剛纔所說，像http://basic.10jqka.com.cn/600000/company.html鏈接不工作。

因此，當解析結果爲空時，我可以回退到bs4，但我想了解爲什麼lxml只是無法從源代碼中解析合理的dom樹。謝謝

來源

2017-07-03 Junchao Gu

fromstring函數lxml.html需要一個string type variable。並且response.content返回一個bytes。使用response.text將是正確的。

對於BeautifulSoup，它的構造函數接受a string or a file-like object。

來源

2017-07-03 09:54:15 stamaimer

我正在使用Python 2.我編輯了我的問題 –

您是否指一些頁面在兩個解析器上工作，某些頁面只能在'bs'上工作？ – stamaimer

是的。許多鏈接lxml只是工作，但它失敗了一些例子，並沒有返回 –

LXML解析HTML返回空的結果，而beautifulsoup返回合理的解析

回答

相關問題