處理HTML文件Python

我對html不太瞭解...... 如何從頁面中刪除文本？例如，如果HTML頁面讀取爲：處理HTML文件Python

<meta name="title" content="How can I make money at home online? No gimmacks please? - Yahoo! Answers"> 
<title>How can I make money at home online? No gimmicks please? - Yahoo! Answers</title>

我只是想提取此。

How can I make money at home online? No gimmicks please? - Yahoo! Answers

我重新使用功能：

def striphtml(data): 
    p = re.compile(r'<.*?>') 
    return p.sub(' ',data)

但仍沒有做什麼，我想讓它做..？

上述功能被稱爲：

for lines in filehandle.readlines(): 

     #k = str(section[6].strip()) 
     myFile.write(lines) 

     lines = striphtml(lines) 
     content.append(lines)

來源

2012-01-09 Fraz

可能重複http://stackoverflow.com/questions/717541/parsing-html-in- python），[使用Python處理HTML文件]（http://stackoverflow.com/q/7694637） – Sathya 2012-01-09 02:45:43

檢查此問題：http://stackoverflow.com/questions/328356/extracting-text-from-html-file - 使用的Python – mgibsonbr 2012-01-09 02:47:15

不要使用正則表達式的HTML/XML解析。改爲嘗試http://www.crummy.com/software/BeautifulSoup/。

from BeautifulSoup import BeautifulSoup 
soup = BeautifulSoup('Your resource<title>hi</title>') 
soup.title.string # Your title string.

來源

2012-01-09 02:47:46

我通常使用http://lxml.de/進行html解析！它非常容易使用，並且非常容易獲得標籤，您可以使用它的xpath！這使得事情變得簡單和快速。

我使用的一個例子，在一個劇本，我沒有讀一個xml飼料和算的話：

https://gist.github.com/1425228

您也可以找到文檔中更多的例子： http://lxml.de/lxmlhtml.html

來源

2012-01-09 02:56:31

爲此使用一個html解析器。其中一個可能是BeautifulSoup

獲得頁面的文本內容：

from BeautifulSoup import BeautifulSoup 


soup = BeautifulSoup(your_html) 
text_nodes = soup.findAll(text = True) 
retult = ' '.join(text_nodes)

[解析HTML在Python（的

來源

2012-01-09 02:58:21 soulcheck

處理HTML文件Python

回答

相關問題