問題... BeautifulSoup解析

<h2 class="sectionTitle">BACKGROUND</h2> 
Mr. Paul J. Fribourg has bla bla</span> 
<div style="margin-top:8px;"> 
    <a href="javascript:void(0)" onclick="show_more(this);">Read Full Background</a> 
</div>

我想從保羅先生提取信息BLABLA 一些網頁有盈保羅先生的，所以我可以使用FindNext('p') 然而，一些網頁沒有像上面的例子..問題... BeautifulSoup解析

這是我當有

background = bs2.find(text=re.compile("BACKGROUND")) 
bb= background.findNext('p').contents

代碼但是，當我沒有做我怎麼能提取信息？

來源

2011-08-27 Willy

從你給我們的例子很難說，但在我看來，你可以在h2之後獲得下一個節點。在這個例子中，劉易斯·卡羅爾有p -aragraph標籤和您的朋友保羅只有關閉span標籤：

>>> from BeautifulSoup import BeautifulSoup 
>>> 
>>> html = ''' 
... <h2 class="sectionTitle">BACKGROUND</h2> 
... <p>Mr. Lewis Carroll has bla bla</p> 
... <div style="margin-top:8px;"> 
...  <a href="javascript:void(0)" onclick="show_more(this);">Read Full Background</a> 
... </div> 
... <h2 class="sectionTitle">BACKGROUND</h2> 
... Mr. Paul J. Fribourg has bla bla</span> 
... <div style="margin-top:8px;"> 
...  <a href="javascript:void(0)" onclick="show_more(this);">Read Full Background</a> 
... </div> 
... ''' 
>>> 
>>> soup = BeautifulSoup(html) 
>>> headings = soup.findAll('h2', text='BACKGROUND') 
>>> for section in headings: 
...  p = section.findNext('p') 
...  if p: 
...   print '> ', p.string 
...  else: 
...   print '> ', section.parent.next.next.strip() 
... 
> Mr. Lewis Carroll has bla bla 
> Mr. Paul J. Fribourg has bla bla

以下意見：

>>> from BeautifulSoup import BeautifulSoup 
>>> from urllib2 import urlopen 
>>> html = urlopen('http://investing.businessweek.com/research/stocks/private/person.asp?personId=668561&privcapId=160900&previousCapId=285930&previousTitle=LOEWS%20CORP') 
>>> soup = BeautifulSoup(html) 
>>> headings = soup.findAll('h2', text='BACKGROUND') 
>>> for section in headings: 
...  paragraph = section.findNext('p') 
...  if paragraph and paragraph.string: 
...   print '> ', paragraph.string 
...  else: 
...   print '> ', section.parent.next.next.strip() 
... 
> Mr. Paul J. Fribourg has been the President of Contigroup Companies Inc. (for [...]

來源

2011-08-28 00:37:30 Johnsyweb

謝謝爲實物回答！其實，保羅先生之前沒有 ..所以如果我運行你的代碼，顯示Read Full Background ....你介意讓我知道解決這個問題的方法嗎？ – Willy

@Willy：我原來的回答是基於一個顯然是你的問題的編輯，其中有人添加了''標籤。我相應地編輯了我的答案。 – Johnsyweb

哦謝謝你太多了！它工作得很好..但在我的原始網站上它不起作用..：（（我想哭.. – Willy

「有些網頁有盈保羅先生的，所以我可以使用FindNext中（‘P’），然而，一些網頁沒有像上面的例子。」

你沒有給予足夠的信息，以便能夠識別您的字符串：

固定節點結構如getChildren（）[1] .getChildren（）[0] .text
如果根據您的代碼在魔術字符串'BACKGROUND'前面加上魔術字符串，那麼您找到下一個節點的方法看起來不錯 - 只是不要構建假設該標記的名稱是「p」
正則表達式（如「（先生|女士）......」）

向我們展示一個HTML例子，當它沒有在前面名字？

來源

2011-08-28 00:09:09 smci

謝謝你的好評！我認爲你的第二點是正確的..字符串背景可能是魔術字符串..但我一直在考慮在單詞後面提取文本的方式..我不知道..它不工作.. – Willy

問題... BeautifulSoup解析

回答

相關問題