xml.etree.ElementTree.ParseError：由於腳本中的「<」符號而導致格式不正確（無效令牌）

我試圖解析網頁，以便在Excel或csv文件中保存一些數據。xml.etree.ElementTree.ParseError：由於腳本中的「<」符號而導致格式不正確（無效令牌）

import urllib.request 
import xml.etree.ElementTree as ET 

url = "http://rusdrama.com/afisha" 
response = urllib.request.urlopen(url) 
content = response.read() 
root = ET.fromstring(content)

當使用fromstring方法ElementTree的我得到了以下錯誤解析頁面：

Traceback (most recent call last): 
    File "D:/PythonProjects/PythonMisc/theater_reader.py", line 7, in <module> 
    root = ET.fromstring(content) 
    File "D:\Python\Python35\lib\xml\etree\ElementTree.py", line 1333, in XML 
    parser.feed(text) 
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 49, column 14

接收的頁面的部分如下：

<script> 
    jQuery(document).ready(function(){ 
    jQuery(window).scroll(function() { 
     var scroll = jQuery(window).scrollTop(); 
     if (scroll >= 100) { 
      jQuery(".t3-header").addClass("solid"); 
     } 
     if (scroll <= 100) { 
      jQuery(".t3-header").removeClass("solid"); 
     } 
    }); 
    }) 
</script>

而且專門線49：

if (scroll <= 100) {

所以問題在於打開角度支架，似乎是作爲開始標記符號處理的。我看到了幾個類似的問題，但無法理解如何處理這種情況。

來源

2016-11-16 Aleks Lee

您正在使用XML解析器打開此。 XML需要'<', '>'和'＆'被轉義。 – njzk2

您可能想使用HTML解析器。 – njzk2

謝謝！我沒想過用不xml解析器） –

您試圖使用XML解析器解析HTML。使用合適的工具，HTML解析器，而不是：BeautifulSoup或lxml.html是最流行的。

演示：

>>> from bs4 import BeautifulSoup 
>>> import urllib.request 
>>> 
>>> url = "http://rusdrama.com/afisha" 
>>> response = urllib.request.urlopen(url) 
>>> 
>>> soup = BeautifulSoup(response, "html.parser") 
>>> print(soup.title.get_text()) 
Афиша Харьковского академического русского драматического театра Пушкина

來源

2016-11-16 20:36:43 alecxe

謝謝！它幫助到我。 –

xml.etree.ElementTree.ParseError：由於腳本中的「<」符號而導致格式不正確（無效令牌）

回答

相關問題