2014-08-28 58 views
1

我試圖解析一個網站,有一個看起來非常相似,這樣的結構:解析符號列表以正確的順序與beautifulsoup

<div class="InternaTesto"> 
<p class="MarginTop0">Paragraph 1</p><br> 
<p>Paragraph 2</p><br> 
<p><strong>Paragraph 3</strong></p><br> 
<ul> 
    <li style="margin: 0px; text-indent: 0px;"><em>List item 1</em></li> 
    <li style="margin: 0px; text-indent: 0px;"><em>List item 2</em></li> 
    <li style="margin: 0px; text-indent: 0px;"><em>List item 3</em></li> 
    ... Some Other Items ... 
</ul> 
<p><strong>Paragraph 4</strong></p><br> 
<ul> 
    <li style="margin: 0px; text-indent: 0px;"><em>List item 1</em></li> 
    <li style="margin: 0px; text-indent: 0px;"><em>List item 2</em></li> 
    <li style="margin: 0px; text-indent: 0px;"><em>List item 3</em></li> 
    ... Some Other Items ... 
</ul> 
... Some Other paragraphs ... 
</div> 

我試圖提取列表項,並把它們下正確的段落。現在我能夠找到列表項目,但它沒有按照正確的順序。這裏是我使用的代碼:

textOfTheArticle=[] 

for p in rawArticleData.find('div', attrs={'class':'InternaTesto'}).find_all("p"): 
    textOfTheArticle.append(p.get_text()) 
    print(p.get_text() + "\n") 

有什麼辦法來創建一個子列表或所有<li>項目單獨列表?

回答

1

你可以找到所有段落,併爲每一個獲得第三下一個兄弟:

from bs4 import BeautifulSoup 

data = """ 
Your html here 
""" 

soup = BeautifulSoup(data) 
for p in soup.find('div', attrs={'class':'InternaTesto'}).find_all("p"): 
    print p.text, [li.text for li in list(p.next_siblings)[2].find_all('li')] 

打印:

Paragraph 1 [] 
Paragraph 2 [] 
Paragraph 3 [u'List item 1', u'List item 2', u'List item 3'] 
Paragraph 4 [u'List item 1', u'List item 2', u'List item 3'] 

更可靠的方法是迭代的下一個兄弟姐妹每段直到我們碰到下一段標籤:

soup = BeautifulSoup(data) 
for p in soup.find('div', attrs={'class':'InternaTesto'}).find_all("p"): 
    print p.text 
    for sibling in p.next_siblings: 
     if sibling.name == 'ul': 
      print [li.text for li in sibling.find_all('li')] 
     if sibling.name == 'p': 
      break 

希望有所幫助。