從多個標記中提取innerHTML

我有一個任務是使用Perl從HTML鏈接中提取內部html文本。從多個標記中提取innerHTML

下面是一個例子，

<a href="www.stackoverflow.com">Regex Question</a>

我想提取字符串：正則表達式問題

需要注意的是，內部文本可能是空的這個樣子。這個例子得到一個空字符串。

<a href="www.stackoverflow.com"></a>

並且內部文本可能被多個標籤所包圍，如下所示。

<a href="www.stackoverflow.com"><b><h2>Regex Question</h2></b></a>

我試圖寫一段Perl的正則表達式，但沒有成功。特別是，我不知道如何處理多個標籤。

來源

2014-10-27 zdd

爲什麼使用正則表達式而不是解析器？ – hwnd 2014-10-27 03:42:18

其實，你的意思是「與他們打交道」。他們將匹配如果在一個標籤之間的權利？ Perl有一些相當不錯的html解析器模塊可用。 – sln 2014-10-27 03:42:44

<a[^>]*>(?:<[^>]*>)*([^<>]*)(?:<[^>]*>)*<\/a>

嘗試this.See demo.Grab捕獲或匹配。

http://regex101.com/r/sU3fA2/1

來源

2014-10-27 04:34:06 vks

謝謝，這個工程。 – zdd 2014-10-27 06:16:06

@zdd你不客氣:) – vks 2014-10-27 06:16:52

它的工作，除了它outter標籤匹配以及' kbjhkb' – sln 2014-10-27 16:38:30

如何像

(?<=>)[^<>\/]*(?=<\/)

將匹配字符串：Regex Question

例如：http://regex101.com/r/sG4bZ1/1

來源

2014-10-27 04:01:28 nu11p01n73R

這一個看起來很簡單，但它不匹配空字符串。 – zdd 2014-10-27 06:26:31

@zdd空字符串？ – nu11p01n73R 2014-10-27 07:03:01

@zdd明白了！ – nu11p01n73R 2014-10-27 07:04:43

應該使用HTML解析器，但使用正則表達式也許可以這樣做。
這發現可以關閉沒有嵌套A標籤的A-標籤對，並且
允許其他標籤在內容中。
如果你想要的標籤內容沒有其他標籤，它會略有不同（未顯示）。

由於您使用的是Perl，因此可能會有效。

# =~ /(?s)<a(?>\s+(?:".*?"|'.*?'|[^>]*?)+>)(?<!\/>)((?:(?!(?><a(?>\s+(?:".*?"|'.*?'|[^>]*?)+>)|<\/a\s*>)).)*)<\/a\s*>/ 

(?s) 
<a       # Begin A-tag, must (should) contain attrib/val's 
(?> 
     \s+      # (?!\s) add this if you think malformed '<a >' could slip by 
     (?: " .*? " | ' .*? ' | [^>]*?)+ 
     > 
) 
(?<! />)      # Lookbehind, Insure this is not a closed A-tag '<a/>' 
(       # (1 start), Capture Content between open/close A-tags 
     (?:       # Cluster, match content 
      (?!       # Negative assertion 
       (?> 
        <a       # Not Start A-tag 
        (?> 
          \s+ 
          (?: " .*? " | ' .*? ' | [^>]*?)+ 
          > 
        ) 
        | </a \s* >      # and Not End A-tag 
       ) 
      ) 
      .        # Assert passed, consume a content character 
    )*       # End Cluster, do 0 to many times 
)        # (1 end) 
</a \s* >      # End A-tag

來源

2014-10-27 04:07:49 sln

解析HTML正則表達式通過是一個壞主意，你不查克·諾里斯。您可以使用Mojo::DOM模塊，這將使您的任務變得非常簡單。

樣本：

use Mojo::DOM; 

# Parse 
my $dom = Mojo::DOM->new('<a href="www.stackoverflow.com"><b><h2>Regex Question</h2></b></a>'); 

# Find 
say $dom->at('a')->text; 
say $dom->find('a')->text;

要安裝魔:: DOM只需鍵入以下命令

$ cpan Mojo::DOM

來源

2014-10-27 05:00:28

使用HTML解析器解析HTML。

如果你需要從網上下載內容，我建議你看看Mojo::DOM和Mojo::UserAgent。

以下將拉動所有含有包含stackoverflow的href的鏈接。COM和裏面顯示的文字：

use strict; 
use warnings; 

use Mojo::DOM; 
use Data::Dump; 

my $dom = Mojo::DOM->new(do {local $/; <DATA>}); 

for my $link ($dom->find('a[href*="stackoverflow.com"]')->each) { 
    dd $link->all_text; 
} 

__DATA__ 
<html> 
<body> 
<a href="www.stackoverflow.com">Regex Question</a> 
I want to extract the string: Regex Question 

<a href="www.notme.com">Don't want this link</a> 
Note that, the inner text might be empty like this. This example get an empty string. 

<a href="www.stackoverflow.com"></a> 
and the inner text might be enclosed with multiple tags like this. 

<a href="www.stackoverflow.com"><b><h2>Regex Question with tags</h2></b></a> 
</body> 
</html>

輸出：

"Regex Question" 
"" 
"Regex Question with tags"

對於一個有用的8分鐘介紹視頻，請Mojocast Episode 5。

來源

2014-10-27 05:01:50 Miller

從多個標記中提取innerHTML

回答

相關問題