正則表達式來提取所有的鏈接和相應的鏈接文字

-1

我全新的正則表達式，而我試圖解決這兩個以下問題：正則表達式來提取所有的鏈接和相應的鏈接文字

寫的正則表達式提取所有鏈接和來自HTML頁面的相應鏈接文本。例如，如果你想解析：
```
text1 <a href="http://example.com">hello, world</a> text2 
```

並得到結果

http://example.com <tab> hello, world

做同樣的事情，而且處理情況< ...>嵌套：

text1 <a href="http://example.com" onclick="javascript:alert('<b>text2</b>')">hello, world</a> text3

到目前爲止，我仍然處在第一個問題上，並且我嘗試了幾種方法。我認爲我的第一個最好的答案是正則表達式(?<=a href=\")(.*)(?=</a>)它給了我：http://example.com">hello, world

這對我來說似乎很好，但我不知道我應該如何接近第二部分。任何幫助或見解將不勝感激。

來源

2016-12-15 Zach Ellis

正則表達式與嵌套不好。你應該考慮一個真正的html解析器。 –

http://stackoverflow.com/a/1732454/6779307 –

那麼我該如何回答這個問題呢？只要說PLZ沒有正則表達式的HTML解析？ –

如果你有HTML解析器像BeautifulSoup來解決這個問題，它僅僅歸結爲定位a元素，使用對href屬性類似於字典的訪問和get_text()用於獲取元素的文本：

In [1]: from bs4 import BeautifulSoup 

In [2]: l = [ 
    """text1 <a href="http://example.com">hello, world</a> text2""", 
    """text1 <a href="http://example.com" onclick="javascript:alert('<b>text2</b>')">hello, world</a> text3""" 
] 

In [3]: for s in l: 
      soup = BeautifulSoup(s, "html.parser") 
      link = soup.a 
      print(link["href"] + "\t" + link.get_text()) 
    ...:  
http://example.com hello, world 
http://example.com hello, world

來源

2016-12-15 20:32:29 alecxe

既然你提到的正則表達式

import re 

line1 = "text1 <a href=」http://example.com」>hello, world</a> text2" 
line2 = "text1 <a href=」http://example.com」 onclick=」javascript:alert(‘<b>text2</b>’)」>hello, world</a> text3" 


link1 = re.search("<. href=(.*)<\/.>",line1) 
print(link1.group(1)) 
link2 = re.search("<. href=(.*)<\/.>",line2) 
print(link2.group(1))

輸出

」http://example.com」>hello, world 
」http://example.com」 onclick=」javascript:alert(‘<b>text2</b>’)」>hello, world

來源

2016-12-15 20:43:56

使用正則表達式，有時候最好看看你不應該捕獲的東西，而不是你應該得到你想要的東西。這Perl的正則表達式應該可靠地捕獲簡單鏈接以及相關的文字：

#!perl 

use strict; 
use warnings; 

my $sample = q{text1 <a href="http://example.com">hello, world</a> text2}; 

my ($link, $link_text) = $sample =~ m{<a href="([^"]*)"[^>]*>(.*?)</a>}; 

print "$link \t $link_text\n"; 

1;

這將打印：

http://example.com <tab> hello, world

要打破它在做什麼：

第一次捕捉，([^"]*)，期待對於不是雙引號的href屬性中的0個或更多字符。方括號用於列出一系列字符，並且前導克拉指示正則表達式查找任何不在此範圍內的字符。

同樣，我使用[^>]*>來找到a標記的右括號，而不必擔心標記中可能包含的其他屬性。

最後，(.*?)是一個0或更多的非貪婪捕獲（由問號指示）來捕獲該鏈接內的所有文本。如果沒有非貪婪指示符，它會將所有文本與文檔中最後一個關閉</a>標籤匹配。

希望這會幫助你解決作業的第2部分。 :)

來源

2016-12-16 21:02:15 interduo

正則表達式來提取所有的鏈接和相應的鏈接文字

回答

相關問題