在Python中驗證（X）HTML

如何驗證文檔是否遵循某個HTML版本（我可以指定優先）？我希望能夠知道失敗發生的位置，例如在基於Web的驗證器中，除了原生Python應用程序。在Python中驗證（X）HTML

來源

2008-08-30 cdleary

請注意，驗證與整理不同！人們發佈的一些答案是關於自動更正HTML，而不是僅僅驗證HTML是否有效。 – Flimm 2017-05-26 12:00:49

XHTML很簡單，使用lxml。

HTML比較困難，因爲傳統上對HTML驗證（通過驗證器，yikes運行StackOverflow本身）沒有那麼多興趣。最簡單的解決方案是執行外部應用程序，例如nsgmls或OpenJade，然後解析其輸出。

來源

2008-08-30 01:20:52

注意：lxml鏈接已損壞。也許使用[this one]（http://lxml.de/）？ – 2014-07-10 18:17:57

我認爲HTML tidy會做你想做的。它有一個Python綁定。

來源

2008-08-30 01:48:07 Neall

嘗試tidylib。你可以得到一些真正基本的綁定作爲元素模塊的一部分（從HTML文檔構建元素樹）。 http://effbot.org/downloads/#elementtidy

>>> import _elementtidy 
>>> xhtml, log = _elementtidy.fixup("<html></html>") 
>>> print log 
line 1 column 1 - Warning: missing <!DOCTYPE> declaration 
line 1 column 7 - Warning: discarding unexpected </html> 
line 1 column 14 - Warning: inserting missing 'title' element

解析日誌應該給你幾乎所有你需要的東西。

來源

2008-08-30 01:55:50

您可以決定在本地安裝HTML驗證程序並創建一個客戶端來請求驗證。

這裏我做了一個程序來驗證一個txt文件中的URL列表。我只是檢查頭部以獲得驗證狀態，但如果您執行GET操作，您將獲得完整結果。看看驗證器的API，這裏有很多選項。

import httplib2 
import time 

h = httplib2.Http(".cache") 

f = open("urllistfile.txt", "r") 
urllist = f.readlines() 
f.close() 

for url in urllist: 
    # wait 10 seconds before the next request - be nice with the validator 
    time.sleep(10) 
    resp= {} 
    url = url.strip() 
    urlrequest = "http://qa-dev.w3.org/wmvs/HEAD/check?doctype=HTML5&uri="+url 
    try: 
     resp, content = h.request(urlrequest, "HEAD") 
     if resp['x-w3c-validator-status'] == "Abort": 
     print url, "FAIL" 
     else: 
     print url, resp['x-w3c-validator-status'], resp['x-w3c-validator-errors'], resp['x-w3c-validator-warnings'] 
    except: 
     pass

來源

2009-03-14 22:42:12 karlcow

不幸的是，`html5lib` [不驗證]（http://stackoverflow.com/a/29992363/593047）。 – 2017-01-25 01:15:39

PyTidyLib是一個不錯的Python Tidy的Python綁定。他們的例子：

from tidylib import tidy_document 
document, errors = tidy_document('''<p>f&otilde;o <img src="bar.jpg">''', 
    options={'numeric-entities':1}) 
print document 
print errors

而且它與兩個legacy HTML Tidy和new tidy-html5兼容。

來源

2009-08-14 18:04:38

Debian中的軟件包：python-tidylib – sumid 2012-10-22 22:26:15

我覺得最優雅的方式，來調用W3C驗證服務在

http://validator.w3.org/

編程。很少有人知道，你不必爲了得到結果屏幕刮的結果，因爲服務返回非標準的HTTP標頭PARAMATERS

X-W3C-Validator-Recursion: 1 
X-W3C-Validator-Status: Invalid (or Valid) 
X-W3C-Validator-Errors: 6 
X-W3C-Validator-Warnings: 0

指示的有效性和錯誤和警告的數目。

例如，命令行

curl -I "http://validator.w3.org/check?uri=http%3A%2F%2Fwww.stalsoft.com"

HTTP/1.1 200 OK 
Date: Wed, 09 May 2012 15:23:58 GMT 
Server: Apache/2.2.9 (Debian) mod_python/3.3.1 Python/2.5.2 
Content-Language: en 
X-W3C-Validator-Recursion: 1 
X-W3C-Validator-Status: Invalid 
X-W3C-Validator-Errors: 6 
X-W3C-Validator-Warnings: 0 
Content-Type: text/html; charset=UTF-8 
Vary: Accept-Encoding 
Connection: close

因此，可以優雅調用W3C驗證服務並提取從HTTP報頭中的結果：

# Programmatic XHTML Validations in Python 
# Martin Hepp and Alex Stolz 
# [email protected]/[email protected] 

import urllib 
import urllib2 

URL = "http://validator.w3.org/check?uri=%s" 
SITE_URL = "http://www.heppnetz.de" 

# pattern for HEAD request taken from 
# http://stackoverflow.com/questions/4421170/python-head-request-with-urllib2 

request = urllib2.Request(URL % urllib.quote(SITE_URL)) 
request.get_method = lambda : 'HEAD' 
response = urllib2.urlopen(request) 

valid = response.info().getheader('X-W3C-Validator-Status') 
if valid == "Valid": 
    valid = True 
else: 
    valid = False 
errors = int(response.info().getheader('X-W3C-Validator-Errors')) 
warnings = int(response.info().getheader('X-W3C-Validator-Warnings')) 

print "Valid markup: %s (Errors: %i, Warnings: %i) " % (valid, errors, warnings)

來源

2012-05-09 15:53:46

W3C Validator還有一個完整的Web服務API和一個Python綁定到它：https://bitbucket.org/nmb10/py_w3c – 2012-05-09 16:22:30

在我的情況下，python W3C/HTML驗證包不起作用pip search w3c（截至2016年9月）。

我解決了這個在這裏python requests

$ pip install requests 

$ python 
Python 2.7.12 (default, Jun 29 2016, 12:46:54) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin 
Type "help", "copyright", "credits" or "license" for more information. 

>>> r = requests.post('https://validator.w3.org/nu/', 
...     data=file('index.html', 'rb').read(), 
...     params={'out': 'json'}, 
...     headers={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36', 
...     'Content-Type': 'text/html; charset=UTF-8'}) 

>>> r.text 
>>> u'{"messages":[{"type":"info", ... 

>>> r.json() 
>>> {u'messages': [{u'lastColumn': 59, ...

更多的文檔，W3C Validator API

來源

2016-09-05 19:30:06 r3x

這是一個基於LXML的HTMLParser的一個非常基本的HTML驗證。它不需要任何互聯網連接。

_html_parser = None 
def validate_html(html): 
    global _html_parser 
    from lxml import etree 
    from StringIO import StringIO 
    if not _html_parser: 
     _html_parser = etree.HTMLParser(recover = False) 
    return etree.parse(StringIO(html), _html_parser)

注意，這將不檢查結束標記，因此，例如，下面將通過：

validate_html("<a href='example.com'>foo</a>")

但不會如下：

validate_html("<a href='example.com'>foo</a")

來源

2016-10-24 23:11:17 speedplane

在Python中驗證（X）HTML

回答

相關問題