2011-10-06 40 views
0

我想湊使用BeautifulSoup以下頁面的內容,Python腳本使用beautifulSoup湊網頁

<div data-referrer="pagelet_123" id="pagelet_123"> 
<div id="1" class="p1"> 
<div class="uiHeader uiHeaderTopAndBottomBorder uiHeaderSection"> 
<div class="clearfix uiHeaderTop"> 
<div> 
<h4 class="uiHeaderTitle">info - 1</h4> 
</div></div></div><div class="phs"> 
<table class="uicontenttable"> 
<tbody> 
<tr> 
<th class="label">Other</th> 
<td class="data"><div id="ua94ty_3" class="uiCollapsedList uiCollapsedListHidden uiCollapsedListNoSeparate pagesListData"> 
<span class="visible"> 
<a href="http://abc.com/Federer">info-2</a>, 
<a href="http://abc.com/pages/Ian-Wright-Out-of-Bounds/117602014955747">info-3</a>, 
<a href="http://abc.com/JuniperNetworks">info-4</a>, 
<a href="http://abc.com/pages/Join-Diaspora/118635234836351">info-5</a> 
</span> 
</div> 
</td> 
<td class="rightCol"> 
</td> 
</tr> 
</tbody> 
</table> 
</div> 
</div> 
</div> 
<div data-referrer="pagelet_ent" id="pagelet_ent"> 
<div id="2" class="section2"> 
<div class="uiHeader uiHeaderTopAndBottomBorder uiHeaderSection"> 
<div class="clearfix uiHeaderTop"> 
<div> 
<h4 class="uiHeaderTitle">info-6</h4> 
</div></div></div> 
<div class="phs"><table class="uiInfoTable mtm profileInfoTable"> 
<tbody> 
<tr> 
<th class="label">info - 7</th><td class="data"> 
<div class="mediaRowWrapper "> 
<ul class="uiList uiListHorizontal clearfix pbl mediaRow"> 
<li class="uiListItem uiListHorizontalItemBorder uiListHorizontalItem"> 
<a href="URL - 1"> 
<div class="mediaPortrait"> 
<div style="height: 75px; width: 75px;" class="fbProfileScalableThumb photo"> 
<img width="87.00090480941" style="margin: -6px 0 0 -6px;" title="Hans Zimmer" alt="" src="http://profile.ak.fbcdn.net/hprofile-ak-snc4/203614_7170054127_6578457_s.jpg" class="img"></div><div class="mediaPageName">info - 8</div></div></a></li><li class="pls uiListItem uiListHorizontalItemBorder uiListHorizontalItem"> 

<a href="URL - 2"> 
<div class="mediaPortrait"><div style="height: 75px; width: 75px;" class="fbProfileScalableThumb photo"><img width="87.00090480941" style="margin: -6px 0 0 -6px;" title="test" alt="" src="http://external.ak.fbcdn.net/safe_image.php?d=AQCVRllyopjA_z5F&amp;w=100&amp;h=300&amp;url=http%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2F5%2F59%2F-2.jpg&amp;fallback=hub_music&amp;prefix=s" class="img"></div><div class="mediaPageName">test</div></div></a> 
</div> 
<div class="mediaPageName">info - 8 
</div> 
</div> 
</a> 

此頁面包含多個嵌套的div和表格。需要幫助使用BeautifulSoup到 只解析info - 1 info -2 ... info -6和URL - 1和URL -2。

我讀了BeautifulSoup的文檔,沒有太大的幫助。也請建議一些BeautifulSoup參考文檔,解析複雜網頁的書。

感謝您的幫助,感謝!

sat

回答

2

他們的文檔不符合您的用途?

http://www.crummy.com/software/BeautifulSoup/documentation.html

它看起來像你對我會想是這樣的:

from BeautifulSoup import BeautifulSoup 
import re 
soup = BeautifulSoup(theXMLAsAString) 
results = soup.findAll(re.compile('info - [1-6]')) 
for r in results: 
    myurl = r.parent.href 

該代碼沒有進行測試,但如何使用BeautifulSoup的總體思路。

相關問題