我試圖讓所有的URL與id='revSAR'從下面的HTML標籤，使用Python的正則表達式：如何從此HTML標記中提取網址？

<a id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'> 
    See all 136 customer reviews 
</a>

我嘗試下面的代碼，但它不工作（不打印輸出）：

regex = b'<a id="revSAR" href="(.+?)" class="txtsmall noTextDecoration">(.+?)</a>' 
pattern=re.compile(regex) 
rev_url=re.findall(pattern,txt) 
print ('reviews url: ' + str(rev_url))

來源

2013-08-20 Vijay Kumar

解析'用美麗的湯了'鏈接的例子：https://groups.google.com/forum/?fromgroups#!topic/beautifulsoup/8TbctreqvSI – Paul

或者http://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautiful-soup – Paul

你並不需要匹配那些不必要的部件，如id=...，href=...，試試這個：

regex = 'http://.*\'\s+'

來源

2013-08-20 05:50:03 MrROY

由於亞馬遜產品評論頁面中有多個網址，我只想提取以此ID開頭的標籤的URL < a id ='revSAR' –

你可以嘗試像

(_, url), = re.findall(r'href=([\'"]*)(\S+)\1', input) 
print url

不過，我個人寧願使用HTML解析庫像BeautifulSoup像這樣的任務。

來源

2013-08-20 05:55:42

Will BeautiflSoup適用於Windows？我如何安裝和設置下python 33並使其工作？ –

我不是在Windows上，所以我從來沒有做過，但這篇文章似乎有在Windows上安裝BeautifulSoup的提示：[如何在Windows上安裝美麗的湯4與python 2.7]（http://stackoverflow.com/問題/ 12228102 /如何安裝的，美麗的湯-4與 - 蟒蛇-2-7的窗口） –

首先，爲什麼你的正則表達式不起作用？在你的html中，這些屬性是用單引號括起來的，正如在正則表達式中它的雙引號一樣。你只需要關心href屬性。嘗試一些東西href=['"](.+?)['"]作爲正則表達式，它會更好，如果你使用忽略大小寫開關

但它同樣是一個非常糟糕的決定，使用正則表達式解析HTML。請通過this

來源

2013-08-20 06:02:36 Jithin

說明

這p21蛋白表達將：

發現錨標記
需要錨標籤有值revSAR
id屬性將捕獲href屬性值，如果它們不存在，則不包括任何周圍的引號
將捕獲內部文本，並修剪白色空間
將使屬性出現在任何順序
允許屬性有雙引號，單引號或沒有引號
避免許多的邊緣案件頻頻絆倒正則表達式時，模式匹配HTML

<a(?=\s|>)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sid=(['"]?)revSAR\1(?:\s|>)) (?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\shref=(['"]?)(.*?)\2(?:\s|>))(?:[^>=]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>\s*(.*?)\s*<\/a>

enter image description here

實例

現場演示

示例文字

注意前幾錨標籤這裏有一些非常困難的邊緣情況。

<a onmouseover=' id="revSAR" ; href="http://www.NotYourURL.com" ; if (3 <href&& href="http://www.NotYourURL.com" && 6>3) { funRotate(href) ; } ; ' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'> 
  You shouldn't find me 
</a> 



<a onmouseover=' img = 10; href="http://www.NotYourURL.com" ; if (3 <href&& href="http://www.NotYourURL.com" && 6>3) { funRotate(href) ; } ; ' id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'> 
    See all 111 customer reviews 
</a> 


<a id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'> 
    See all 136 customer reviews 
</a>

相配

組0獲取整個錨定標記
組1得到周圍稍後用於查找正確的閉引號
組2得到周圍報價的id屬性的引用稍後用於查找正確結束報價的href屬性
第3組獲取href屬性值，不包括任何報價第4組獲取內部文本，不包括任何周圍的whitespac Ë

[0][0] = <a onmouseover=' img = 10; href="http://www.NotYourURL.com" ; if (3 <href&& href="http://www.NotYourURL.com" && 6>3) { funRotate(href) ; } ; ' id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'> 
    See all 111 customer reviews 
</a> 
[0][1] = ' 
[0][2] = ' 
[0][3] = http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending 
[0][4] = See all 111 customer reviews 


[1][0] = <a id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'> 
    See all 136 customer reviews 
</a> 
[1][1] = ' 
[1][2] = ' 
[1][3] = http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending 
[1][4] = See all 136 customer reviews

來源

2013-08-20 14:30:36

如何從此HTML標記中提取網址？

回答

說明

相關問題