2013-08-20 60 views
0

我試圖讓所有的URL與id='revSAR'從下面的HTML標籤,使用Python的正則表達式:如何從此HTML標記中提取網址?

<a id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'> 
    See all 136 customer reviews 
</a> 

我嘗試下面的代碼,但它不工作(不打印輸出):

regex = b'<a id="revSAR" href="(.+?)" class="txtsmall noTextDecoration">(.+?)</a>' 
pattern=re.compile(regex) 
rev_url=re.findall(pattern,txt) 
print ('reviews url: ' + str(rev_url)) 
+0

解析'用美麗的湯了'鏈接的例子:https://groups.google.com/forum/?fromgroups#!topic/beautifulsoup/8TbctreqvSI – Paul

+0

或者http://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautiful-soup – Paul

回答

0

你並不需要匹配那些不必要的部件,如id=...href=...,試試這個:

regex = 'http://.*\'\s+'

+0

由於亞馬遜產品評論頁面中有多個網址,我只想提取以此ID開頭的標籤的URL < a id ='revSAR' –

1

你可以嘗試像

(_, url), = re.findall(r'href=([\'"]*)(\S+)\1', input) 
print url 

不過,我個人寧願使用HTML解析庫像BeautifulSoup像這樣的任務。

+0

Will BeautiflSoup適用於Windows?我如何安裝和設置下python 33並使其工作? –

+0

我不是在Windows上,所以我從來沒有做過,但這篇文章似乎有在Windows上安裝BeautifulSoup的提示:[如何在Windows上安裝美麗的湯4與python 2.7](http://stackoverflow.com/問題/ 12228102 /如何安裝的,美麗的湯-4與 - 蟒蛇-2-7的窗口) –

0

首先,爲什麼你的正則表達式不起作用?在你的html中,這些屬性是用單引號括起來的,正如在正則表達式中它的雙引號一樣。你只需要關心href屬性。嘗試一些東西href=['"](.+?)['"]作爲正則表達式,它會更好,如果你使用忽略大小寫開關

但它同樣是一個非常糟糕的決定,使用正則表達式解析HTML。請通過this

0

說明

這p21蛋白表達將:

  • 發現錨標記
  • 需要錨標籤有值revSAR
  • id屬性將捕獲href屬性值,如果它們不存在,則不包括任何周圍的引號
  • 將捕獲內部文本,並修剪白色空間
  • 將使屬性出現在任何順序
  • 允許屬性有雙引號,單引號或沒有引號
  • 避免許多的邊緣案件頻頻絆倒正則表達式時,模式匹配HTML

<a(?=\s|>)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sid=(['"]?)revSAR\1(?:\s|>)) (?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\shref=(['"]?)(.*?)\2(?:\s|>))(?:[^>=]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>\s*(.*?)\s*<\/a>

enter image description here

實例

現場演示

示例文字

注意前幾錨標籤這裏有一些非常困難的邊緣情況。

<a onmouseover=' id="revSAR" ; href="http://www.NotYourURL.com" ; if (3 <href&& href="http://www.NotYourURL.com" && 6>3) { funRotate(href) ; } ; ' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'> 
  You shouldn't find me 
</a> 



<a onmouseover=' img = 10; href="http://www.NotYourURL.com" ; if (3 <href&& href="http://www.NotYourURL.com" && 6>3) { funRotate(href) ; } ; ' id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'> 
    See all 111 customer reviews 
</a> 


<a id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'> 
    See all 136 customer reviews 
</a> 

相配

組0獲取整個錨定標記
組1得到周圍稍後用於查找正確的閉引號
組2得到周圍報價的id屬性的引用稍後用於查找正確結束報價的href屬性
第3組獲取href屬性值,不包括任何報價 第4組獲取內部文本,不包括任何周圍的whitespac Ë

[0][0] = <a onmouseover=' img = 10; href="http://www.NotYourURL.com" ; if (3 <href&& href="http://www.NotYourURL.com" && 6>3) { funRotate(href) ; } ; ' id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'> 
    See all 111 customer reviews 
</a> 
[0][1] = ' 
[0][2] = ' 
[0][3] = http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending 
[0][4] = See all 111 customer reviews 


[1][0] = <a id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'> 
    See all 136 customer reviews 
</a> 
[1][1] = ' 
[1][2] = ' 
[1][3] = http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending 
[1][4] = See all 136 customer reviews