對不起,不明白你想做什麼。
但是,例如你可以很容易收集所有唯一的標題在字典:
from urllib.request import urlopen
import re
address = "http://www.w3schools.com/html/html_head.asp"
webPage = urlopen(address)
# get page content
response = str(webPage.read(), encoding='utf-8')
# leave only <h*> tags content
p = re.compile(r'<(h[0-9])>(.+?)</\1>', re.IGNORECASE | re.DOTALL)
headers = re.findall(p, response)
# headers dict
my_headers = {}
for (tag, value) in headers:
if tag not in my_headers.keys():
my_headers[tag] = []
# remove all tags inside
re.sub('<[^>]*>', '', value)
# replace few special chars
value = value.replace('<', '<')
value = value.replace('>', '>')
if value not in my_headers[tag]:
my_headers[tag].append(value)
# output
print(my_headers)
輸出:
{'h2': ['The HTML <head> Element', 'Omitting <html> and <body>?', 'Omitting <head>', 'The HTML <title> Element', 'The HTML <style> Element', 'The HTML <link> Element', 'The HTML <meta> Element', 'The HTML <script> Element', 'The HTML <base> Element', 'HTML head Elements', 'Your Suggestion:', 'Thank You For Helping Us!'], 'h4': ['Top 10 Tutorials', 'Top 10 References', 'Top 10 Examples', 'Web Certificates'], 'h1': ['HTML <span class="color_h1">Head</span>'], 'h3': ['Example', 'W3SCHOOLS EXAMS', 'COLOR PICKER', 'SHARE THIS PAGE', 'LEARN MORE:', 'HTML/CSS', 'JavaScript', 'HTML Graphics', 'Server Side', 'Web Building', 'XML Tutorials', 'HTML', 'CSS', 'XML', 'Charsets']}
一個問題與當前你寫的是,開始/結束標題標籤可能是不同的路線。我們是否假設html始終有效? –
就我而言,HTML是否有效並不重要。 – Cameron