2014-02-24 71 views
2

我有以下HTML文件,我想提取運行時和查看數據參數。我已經能夠瀏覽到主ID = videouser類,但我不知道現在該怎麼得到相關的文本..在BS4中提取嵌套數據

vid_data = (soup('td', {'id':'videoUser'}))[0] 

<td id="videoUser"> 
<div class="item" style="padding-left: 0;"> 
<span>Added by</span> 
<a href="/user/glanceweb">glanceweb</a> 
<a class="hint" hint="Send private message" href="#" onclick="return openPm('glanceweb')" overicon="iconMailOver"> 
<div class="icon iconMail di" style="margin-bottom:-1px"></div> 
</a> 
<span class="hint" hint="2013-04-01 01:07:00 UTC">10 months ago</span> 
</div> 
<div class="item"><span>Runtime:</span> 02:39</div> 
<div class="item"><span>Views:</span> 284,397</div> 
</td> 

有沒有人知道如何在BS4做到這一點..?

回答

1

如果您正在尋找所有由上述HTML打印的文本,這應該這樣做:那麼

soup = BeautifulSoup(<your-html>) 
div = soup.find_all('div', {'class':'item'})[0] 
user = str(div.find_all('span')[0].string) + ' ' + str(div.find_all('a')[0].string) + ' ' + str(div.find_all('span')[1].string) 
r_div = soup.find_all('div', {'class':'item'})[1] 
runtime = r_div.get_text() 
v_div = soup.find_all('div', {'class':'item'})[2] 
views = v_div.get_text() 

用戶必須:然後

Added by glanceweb 10 months ago 

運行時將有:

Runtime: 02:39 

views then then have

Views: 284,397