如何從此代碼中省略<h>標籤？

所以這段代碼需要一個網站，並將所有頭信息添加到列表中。我怎樣才能修改列表，所以當程序打印時，它顯示在單獨的行上的每一個列表，並擺脫標題標籤？如何從此代碼中省略<h>標籤？

from urllib.request import urlopen 
address = "http://www.w3schools.com/html/html_head.asp" 
webPage = urlopen (address) 

encoding = "utf-8" 

list = [] 

for line in webPage: 
    findHeader = ('<h1>', '<h2>', '<h3>', '<h4>', '<h5>', '<h6>') 
    line = str(line, encoding) 
    for startHeader in findHeader:   
     endHeader = '</'+startHeader[1:] 
     if (startHeader in line) and (endHeader in line): 
      content = line.split(startHeader)[1].split(endHeader)[0] 
      list.append(line) 
      print (list) 

webPage.close()

來源

2015-12-15 Cameron

一個問題與當前你寫的是，開始/結束標題標籤可能是不同的路線。我們是否假設html始終有效？ –

就我而言，HTML是否有效並不重要。 – Cameron

如果你不介意使用第三方軟件包，試圖BeautifulSoup到HTML轉換爲純文本。你有你的列表後，您可以刪除從環print (list)並做到這一點：

for e in list: 
    # .rstrip() to remove trailing '\r\n' 
    print(BeautifulSoup(e.rstrip(), "html.parser").text)

但是不要忘了先導入BeautifulSoup：

from bs4 import BeautifulSoup

我假設你有BS4安裝之前，運行這個例子（pip3安裝beautifulsoup4）。

此外，您可以使用正則表達式去除html標籤。但它可能比使用bs這樣的html解析更加冗長和容易出錯。

來源

2015-12-15 17:25:36 vrs

對不起，不明白你想做什麼。

但是，例如你可以很容易收集所有唯一的標題在字典：

from urllib.request import urlopen 
import re 

address = "http://www.w3schools.com/html/html_head.asp" 
webPage = urlopen(address) 

# get page content 
response = str(webPage.read(), encoding='utf-8') 

# leave only <h*> tags content 
p = re.compile(r'<(h[0-9])>(.+?)</\1>', re.IGNORECASE | re.DOTALL) 
headers = re.findall(p, response) 

# headers dict 
my_headers = {} 

for (tag, value) in headers: 
    if tag not in my_headers.keys(): 
     my_headers[tag] = [] 

    # remove all tags inside 
    re.sub('<[^>]*>', '', value) 

    # replace few special chars 
    value = value.replace('&lt;', '<') 
    value = value.replace('&gt;', '>') 

    if value not in my_headers[tag]: 
     my_headers[tag].append(value) 

# output 
print(my_headers)

輸出：

{'h2': ['The HTML <head> Element', 'Omitting <html> and <body>?', 'Omitting <head>', 'The HTML <title> Element', 'The HTML <style> Element', 'The HTML <link> Element', 'The HTML <meta> Element', 'The HTML <script> Element', 'The HTML <base> Element', 'HTML head Elements', 'Your Suggestion:', 'Thank You For Helping Us!'], 'h4': ['Top 10 Tutorials', 'Top 10 References', 'Top 10 Examples', 'Web Certificates'], 'h1': ['HTML <span class="color_h1">Head</span>'], 'h3': ['Example', 'W3SCHOOLS EXAMS', 'COLOR PICKER', 'SHARE THIS PAGE', 'LEARN MORE:', 'HTML/CSS', 'JavaScript', 'HTML Graphics', 'Server Side', 'Web Building', 'XML Tutorials', 'HTML', 'CSS', 'XML', 'Charsets']}

來源

2015-12-15 17:42:40 mrDinkelman

你問了沒有標題標籤結果。您已在content變量中擁有這些值，但不會將content添加到結果列表中，而是添加line，這是整個原始行。

接下來，您要求打印在新行上的每個項目。要做到這一點，首先刪除循環內的print聲明。打印整個列表每次添加一個結果。接着，在該程序的底部，添加新的代碼外所有的循環：

for item in list: 
    print(item)

不過，您的HTML標識頭的技術還不是很強大的。它預計成對的開啓和關閉標籤在一條線上。它也預計在一行中不會有多於一個的任何類型的標題。它預計每個開標籤都有一個匹配的結束標籤。你不能依賴任何這些東西，即使在有效的 HTML。

Vrs's answer是在正確的軌道建議美味的湯，但不是僅使用它從結果中移除標籤，實際上你可以用它來尋找的結果了。請看下面的代碼：

from bs4 import BeautifulSoup 
from urllib.request import urlopen 

address = "http://www.w3schools.com/html/html_head.asp" 
webPage = urlopen(address) 

# The list of tag names we want to find 
# Just the names, not the angle brackets  
findHeader = ('h1', 'h2', 'h3', 'h4', 'h5', 'h6') 

soup = BeautifulSoup(webPage, 'html.parser') 
headers = soup.find_all(findHeader) 
for header in headers: 
    print(header.get_text())

的find_all方法接受標記名稱的列表，並返回一個表示文檔順序每個結果Tag對象。我們將列表存儲在headers，並打印每個文本。方法get_text僅顯示標籤的文本部分，不僅省略了周圍的標題標籤，而且還省略了任何嵌入的標籤。（有在你刮，例如網頁一些嵌入式span標籤。）

來源

2015-12-15 18:24:34

如何從此代碼中省略<h>標籤？

回答

相關問題