從你給我們的例子很難說,但在我看來,你可以在h2
之後獲得下一個節點。在這個例子中,劉易斯·卡羅爾有p
-aragraph標籤和您的朋友保羅只有關閉span
標籤:
>>> from BeautifulSoup import BeautifulSoup
>>>
>>> html = '''
... <h2 class="sectionTitle">BACKGROUND</h2>
... <p>Mr. Lewis Carroll has bla bla</p>
... <div style="margin-top:8px;">
... <a href="javascript:void(0)" onclick="show_more(this);">Read Full Background</a>
... </div>
... <h2 class="sectionTitle">BACKGROUND</h2>
... Mr. Paul J. Fribourg has bla bla</span>
... <div style="margin-top:8px;">
... <a href="javascript:void(0)" onclick="show_more(this);">Read Full Background</a>
... </div>
... '''
>>>
>>> soup = BeautifulSoup(html)
>>> headings = soup.findAll('h2', text='BACKGROUND')
>>> for section in headings:
... p = section.findNext('p')
... if p:
... print '> ', p.string
... else:
... print '> ', section.parent.next.next.strip()
...
> Mr. Lewis Carroll has bla bla
> Mr. Paul J. Fribourg has bla bla
以下意見:
>>> from BeautifulSoup import BeautifulSoup
>>> from urllib2 import urlopen
>>> html = urlopen('http://investing.businessweek.com/research/stocks/private/person.asp?personId=668561&privcapId=160900&previousCapId=285930&previousTitle=LOEWS%20CORP')
>>> soup = BeautifulSoup(html)
>>> headings = soup.findAll('h2', text='BACKGROUND')
>>> for section in headings:
... paragraph = section.findNext('p')
... if paragraph and paragraph.string:
... print '> ', paragraph.string
... else:
... print '> ', section.parent.next.next.strip()
...
> Mr. Paul J. Fribourg has been the President of Contigroup Companies Inc. (for [...]
你當然可,希望檢查版權聲明,et cetera ...
謝謝爲實物回答!其實,保羅先生之前沒有 ..所以如果我運行你的代碼,顯示Read Full Background ....你介意讓我知道解決這個問題的方法嗎? – Willy
@Willy:我原來的回答是基於一個顯然是你的問題的編輯,其中有人添加了''標籤。我相應地編輯了我的答案。 – Johnsyweb
哦謝謝你太多了!它工作得很好..但在我的原始網站上它不起作用..:((我想哭.. – Willy