使用Python解析HTML文件：起點

-3

我有以下格式的html文件。我想用python解析它。但是，我對使用xml模塊一無所知。您的建議非常受歡迎。使用Python解析HTML文件：起點

注意：對不起，我再無知。問題不是特定的。然而，由於我對這樣的解析腳本感到沮喪，我確實想得到一個由答案人（謝謝大家）描述的具體答案作爲出發點。希望你能理解。

<html> 
    <head> 
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 
    <title>Weibo Landscape: Historical Archive of 800 Verified Accounts</title> 
    </head> 
    <body> 
<div><br> 
related 1-th-weibo:<br> 
mid:3365546399651413<br> 
score:-5.76427445942 <br> 
uid:1893278624 <br> 
link:<a href="http://weibo.com/1893278624/xrv9ZEuLX" target="_blank">source</a> <br> 
time:Thu Oct 06 17:10:59 +0800 2011 <br> 
content: Zuccotti Park。 <br> 
<br></div> 
<div><br> 
related 2-th-weibo:<br> 
mid:3366839418074456<br> 
score:-5.80535767804 <br> 
uid:1813080181 <br> 
link:<a href="http://weibo.com/1813080181/xs2NvxSxa" target="_blank">source</a> <br> 
time:Mon Oct 10 06:48:53 +0800 2011 <br> 
content:rt the tweet <br> 
rtMid:3366833975690765 <br> 
rtUid:1893801487 <br> 
rtContent:#ows#here is the content and the link http://t.cn/aFLBgr <br> 
<br></div> 

    </body> 
    </html>

可能重複：
Extracting text from HTML file using Python

來源

2012-05-02 Frank Wang

關於用Python解析HTML有很多問題。請花幾分鐘時間尋找。在上面鏈接的問題中，請參閱使用'HTMLParser'的示例 –

當然。我搜查過了，這不是我想要的。我希望結果更具結構性，而不是將其轉換爲文本。 –

這只是一個例子 - 關於HTML解析有幾個Q和As：http://stackoverflow.com/search?q=python%20html%20parse –

我做了這個練習。它應該讓你在正確的軌道上，如果這仍然有用。

# -*- coding: utf-8 -*- 

from BeautifulSoup import BeautifulSoup 


html = '''<html> 
    <head> 
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 
    <title>Weibo Landscape: Historical Archive of 800 Verified Accounts</title> 
    </head> 
    <body> 
<div><br> 
related 1-th-weibo:<br> 
mid:3365546399651413<br> 
score:-5.76427445942 <br> 
uid:1893278624 <br> 
link:<a href="http://weibo.com/1893278624/xrv9ZEuLX" target="_blank">source</a> <br> 
time:Thu Oct 06 17:10:59 +0800 2011 <br> 
content: Zuccotti Park。 <br> 
<br></div> 
<div><br> 
related 2-th-weibo:<br> 
mid:3366839418074456<br> 
score:-5.80535767804 <br> 
uid:1813080181 <br> 
link:<a href="http://weibo.com/1813080181/xs2NvxSxa" target="_blank">source</a> <br> 
time:Mon Oct 10 06:48:53 +0800 2011 <br> 
content:rt the tweet <br> 
rtMid:3366833975690765 <br> 
rtUid:1893801487 <br> 
rtContent:#ows#here is the content and the link http://t.cn/aFLBgr <br> 
<br></div> 

    </body> 
    </html>''' 

data = [] 
soup = BeautifulSoup(html) 
divs = soup.findAll('div') 
for div in divs: 
    div_string = str(div) 
    div_string = div_string.replace('<br />', '') 
    div_list = div_string.split('\n') 
    div_list = div_list[1:-1] 
    record = [] 
    for item in div_list: 
     record.append(tuple(item.split(':', 1))) 
    data.append(record) 

for record in data: 
    for field in record: 
     print field 
    print '--------------'

使用您的示例數據，您將獲得此輸出。進一步處理應該很容易按摩到任何你想要的結構中。

('related 1-th-weibo', '') 
('mid', '3365546399651413') 
('score', '-5.76427445942 ') 
('uid', '1893278624 ') 
('link', '<a href="http://weibo.com/1893278624/xrv9ZEuLX" target="_blank">source</a> ') 
('time', 'Thu Oct 06 17:10:59 +0800 2011 ') 
('content', ' Zuccotti Park\xe3\x80\x82 ') 
-------------- 
('related 2-th-weibo', '') 
('mid', '3366839418074456') 
('score', '-5.80535767804 ') 
('uid', '1813080181 ') 
('link', '<a href="http://weibo.com/1813080181/xs2NvxSxa" target="_blank">source</a> ') 
('time', 'Mon Oct 10 06:48:53 +0800 2011 ') 
('content', 'rt the tweet ') 
('rtMid', '3366833975690765 ') 
('rtUid', '1893801487 ') 
('rtContent', '#ows#here is the content and the link http://t.cn/aFLBgr ')

來源

2012-05-02 21:39:33 gauden

這是一個很好的答案。謝謝你。 –

我建議你看一看Python庫BeautifulSoup。它可以幫助您瀏覽和搜索HTML數據。

來源

2012-05-02 07:56:37 HAL

使用Python解析HTML文件：起點

回答

相關問題