2012-04-28 78 views
1

谷歌搜索使我對HTML以下第一個結果:美麗的湯提取谷歌搜索A HREF

<h3 class="r"><a href="http://rads.stackoverflow.com/amzn/click/0470284889" class="l vst" onmousedown="return rwt(this,'','','','1','AFQjCNEv1W9YC2jcSKYdEo2kNqBMJ-Utmg','k89K9hF4cVNpxQYHtEKiUQ','0CCoQFjAA',null,event)"><em>Quantitative Trading</em>: <em>How to Build Your Own Algorithmic</em> <b>...</b> - Amazon</a></h3> 

我想從這個提取路段http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889,但是當我用美麗的湯提取信息,我得到

soup.find("h3").find("a").get("href") 

我得到下面的字符串,而不是:

/url?q=http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Busines S/DP/0470284889 & SA = U & EI = P2ycT6OoNuasiAL2ncV5 & VED = 0CBIQFjAA & USG = AFQjCNEo_ujANAKnjheWDRlBKnJ1BGeA7A

我知道該鏈接是在那裏,我可以通過刪除/ URL解析呢?Q =和之後的一切&符號,但我想知道是否有更清潔的解決方案。

謝謝!

回答

0

您可以使用urlparse.urlparseurlparse.parse_qs組合,e.g

>>> import urlparse 
>>> url = '/url?q=http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889&sa=U&ei=P2ycT6OoNuasiAL2ncV5&ved=0CBIQFjAA&usg=AFQjCNEo_ujANAKnjheWDRlBKnJ1BGe' 
>>> data = urlparse.parse_qs(
...  urlparse.urlparse(url).query 
...) 
>>> data 
{'ei': ['P2ycT6OoNuasiAL2ncV5'], 
'q': ['http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889'], 
'sa': ['U'], 
'usg': ['AFQjCNEo_ujANAKnjheWDRlBKnJ1BGe'], 
'ved': ['0CBIQFjAA']} 
>>> data['q'][0] 
'http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889' 
+0

感謝,這正是我一直在尋找!只是想知道,爲什麼BeautifulSoup()將JavaScript解析爲與我的Web瀏覽器顯示的不同內容?這是否意味着我必須使用html5lib解析器才能獲得正確的結果? – ejang 2012-04-28 22:45:43

+0

@ejang:對不起,但我不知道BeautifulSoup是如何做到的:(你可以發佈一個新的問題,如果你想這將是有趣的:) – mouad 2012-04-28 22:57:22