如何解析一個html文件並通過使用Python獲取標籤之間的文本？

-1

可能重複：
Parsing HTML in Python 如何解析一個html文件並通過使用Python獲取標籤之間的文本？

我尋覓更多在互聯網上得到這是通過使用Python標籤之間的文本。你們能解釋一下嗎？

2011-08-16 vigneshmoha

嚴，http://docs.python.org/library/htmlparser.html？ –

或http://www.crummy.com/software/BeautifulSoup/documentation.html或http://lxml.de/ – agf

或http://stackoverflow.com/questions/6870446/whats-the-most-forgiving- html-parser-in-python或http://stackoverflow.com/questions/5120129/python-html-parsing或http://stackoverflow.com/questions/4895102/python-html-parsing或http：// stackoverflow。 com/questions/2505041/best-library-to-parse-html-with-python-3-and-example – agf

-1

上面評論中鏈接中提供的htmlparser可能是更強大的方法。但是，如果你有這是特定標記之間內容的簡單一點，你可以使用regular expressions

import re 
html = '<html><body><div id='blah-content'>Blah</div><div id='content-i-want'>good stuff</div></body></html>' 
m = re.match(r'.*<div.*id=\'content-i-want\'.*>(.*?)</div>', html) 
if m: 
    print m.group(1) # Should print 'good stuff'

來源

2011-08-16 15:22:26 arunkumar

我不同意使用正則表達式來解析HTML。你的代碼只能用最簡單的例子。如果div有任何其他屬性（如類），它會失敗。如果div中的文本帶有'>'，則會失敗。對於除了一個不現實的簡單例子之外的任何東西，正則表達式都是不夠的。另請參閱http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – murgatroid99

是的HTML解析器庫是要走的路。但是可能會出現這樣的情況：您正在從固定的HTML格式中讀取數據，或者您沒有任何內容，但內置了python庫。在這種情況下，上面的代碼，我已經糾正應該工作。是的，它不如HTML解析器那樣健壯，因此是我答案的第一行。 – arunkumar

下面是一個使用BeautifulSoup解析HTML的例子：

from BeautifulSoup import BeautifulSoup 
soup = BeautifulSoup("""<html><body> 
         <div id="a" class="c1"> 
          We want to get this 
         </div> 
         <div id="b"> 
          We don't want to get this 
         </div></body></html>""") 
print soup('div', id='a').text

此輸出

We want to get this

來源

2011-08-16 15:37:08 murgatroid99

如何解析一個html文件並通過使用Python獲取標籤之間的文本？

回答

相關問題