2014-01-15 33 views
0
..... 
<div class="day"><div class="content">Idag<span id='updatedby'>, by <b>Karl</b> (100)  </span></div></div> 

<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon- Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text1 </div></a><br /> <div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a href="?p=komment&id=xxxxx">18 comments</a></div></div> 

<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon-Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text2 </div></a><br /> <div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a href="?p=komment&id=xxxxx">18 comments</a></div></div> 

<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon-Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text3 </div></a><br /> <div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a href="?p=komment&id=xxxxx">18 comments</a></div></div> 

<div class="day"><div class="content">2014-01-14<span id='updatedby'>, by<b>Person</b> (50)</span></div></div> 

""""**DO NOT PRINT THIS**"""" 
<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon-Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text4 </div></a><br /> <div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a href="?p=komment&id=xxxxx">18 comments</a></div></div> 
""""**DO NOT PRINT THIS**"""" 
.... 

從這個HTML我想提取所有插圖中帶class =「天」的第一個div到帶class =「日」下一個DIV蟒2臺班之間3.3 Beautifulsoup文本

輸出應是:

<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon- Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text1 </div></a><br /><div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a href="?p=komment&id=xxxxx">18 comments</a></div></div> 

<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon-Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text2 </div></a><br /><div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a href="?p=komment&id=xxxxx">18 comments</a></div></div> 

<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon-Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text3 </div></a><br /><div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a href="?p=komment&id=xxxxx">18 comments</a></div></div> 

我當前的代碼如下所示:

from bs4 import BeautifulSoup 
soup = BeautifulSoup(open('text.html')) 
contain = [] 
contain = soup.find_all('div',{'class':'day'}) 
del contain[2::] 
print (contain) 

而與此代碼的輸出我得到的是:

[<div class="day"><div class="content">Idag<span id="updatedby">, by<b>Karl</b> (100)</span></div></div>, <div class="day"><div class="content">2014-01-14<span id="updatedby">, by <b>Person</b> (50)</span></div></div>] 

回答

1

你可以這樣做:

from bs4 import BeautifulSoup 

data = ''' 
<div class="day"><div class="content">Idag<span id='updatedby'>, by <b>Karl</b> (100)  </span></div></div> 
<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon- Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text1 </div></a><br /> <div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a href="?p=komment&id=xxxxx">18 comments</a></div></div> 
<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon-Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text2 </div></a><br /> <div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a href="?p=komment&id=xxxxx">18 comments</a></div></div> 
<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon-Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text3 </div></a><br /> <div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a href="?p=komment&id=xxxxx">18 comments</a></div></div> 
<div class="day"><div class="content">2014-01-14<span id='updatedby'>, by<b>Person</b> (50)</span></div></div> 
<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img src="img/ikon-Hemsida.gif" class="type" alt="Hemsida" /><div class="text"> Sample text4 </div></a><br /> <div class="sbar"><img src="img/comment.gif" class="comment" alt="Kommentarer" /> <a href="?p=komment&id=xxxxx">18 comments</a></div></div> 
''' 
soup = BeautifulSoup(data) 

result = [] 
tag = soup.find_all('div', {'class': 'day'})[0] 
while True: 
    tag = tag.next_sibling 
    if hasattr(tag, 'class') and 'day' in tag['class']: 
     break 
    result.append(tag) 
for e in result: 
    print(e) 

結果:

<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img alt="Hemsida" class="type" src="img/ikon- Hemsida.gif"/><div class="text"> Sample text1 </div></a><br/> <div class="sbar"><img alt="Kommentarer" class="comment" src="img/comment.gif"/> <a href="?p=komment&amp;id=xxxxx">18 comments</a></div></div> 


<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img alt="Hemsida" class="type" src="img/ikon-Hemsida.gif"/><div class="text"> Sample text2 </div></a><br/> <div class="sbar"><img alt="Kommentarer" class="comment" src="img/comment.gif"/> <a href="?p=komment&amp;id=xxxxx">18 comments</a></div></div> 


<div class="link"><a href="out.php?id=XXXXXX" target="_blank"><img alt="Hemsida" class="type" src="img/ikon-Hemsida.gif"/><div class="text"> Sample text3 </div></a><br/> <div class="sbar"><img alt="Kommentarer" class="comment" src="img/comment.gif"/> <a href="?p=komment&amp;id=xxxxx">18 comments</a></div></div> 

此代碼假定您將涉及與一羣兄弟節點(沒有嵌套)的。它從第一個class="day" div開始,然後通過兄弟姐妹並將它們追加到結果列表中,直到它遇到下一個class="day" div,此時break失效。

+0

正是我在找的東西,非常感謝。 – Someone

+0

@Someone很高興我能幫到你。如果這解決了您的問題,請通過單擊upvote/downvote按鈕下方的複選標記來考慮「接受」此答案。 – senshin