如何使用正則表達式從html標記中提取網址和文本

<!-- This Div repeated in HTML with different properties value --> 

<div style="position:absolute; overflow:hidden; left:220px; top:785px; width:347px; height:18px; z-index:36"> 

<!-- Only Unique Thing is This in few pages --> 
<a href="http://link.domain.com/?id=123" target="_parent"> 

<!-- OR in some pages Only Unique Thing is This, ending with mp3 extension --> 
<a href="http://domain.com/song-title.mp3" target="_parent"> 

    <!-- This Div also repeated multiple in HTML --> 

    <FONT style="font-size:10pt" color=#000000 face="Tahoma"> 
     <DIV><B>Harjaiyaan</B> - Nandini Srikar</DIV> 
    </FONT> 
</a> 

</DIV>

我們有非常髒的html標記，它由一些程序或應用程序生成。我們想從這段代碼和'文本'中提取'Urls'。如何使用正則表達式從html標記中提取網址和文本

在 href

我們使用兩種類型的URL，URL 1個圖案：「http://link.domain.com/id=123」，地址2的模式：在第一場比賽「http://domain.com/sons-title.mp3」

，我們是但在第二個URL，我們有不特定的圖案模式只是以'.mp3'擴展名結尾。

是否有一些函數可以從這個模式和文本代碼中提取url？

注意：沒有DOM，有沒有什麼辦法來匹配一個href和正則表達式之間的文本？ preg_match？

來源

2014-02-08 Ahmed iqbal

沒有什麼神奇的功能做所有的工作適合你。你將不得不編寫你想要的代碼。使用DOM解析器（如DOMDocument）來完成此任務。 –

利用DOMDocument類繼續這樣下去。

$dom = new DOMDocument; 
$dom->loadHTML($html); //<------- Pass ur HTML source here 
foreach ($dom->getElementsByTagName('a') as $tag) { 

     echo $tag->getAttribute('href'); 
     echo $tag->nodeValue; // to get the content in between of tags... 

}

來源

2014-02-08 11:04:35

剛試過這個，它效果很好。儘管您可能想要將此行更改爲：echo $ tag-> getAttribute（'href'）; – Grant

@Grant，這是正確的！編輯。 –

擴展在@Shankar達莫達倫的回答是：

$html = file_get_contents('source.htm'); 

$dom = new DOMDocument; 
$dom->loadHTML($html); 
foreach ($dom->getElementsByTagName('a') as $tag) { 

    if (strstr($tag->getAttribute('href'),'?id=') !== false) { 
     echo $tag->getAttribute('href') . "<br>\n"; 
    } 

}

然後做同樣的MP3：

$html = file_get_contents('source.htm'); 

$dom = new DOMDocument; 
$dom->loadHTML($html); 
foreach ($dom->getElementsByTagName('a') as $tag) { 

    if (strstr($tag->getAttribute('href'),'.mp3') !== false) { 
     echo $tag->getAttribute('href') . "<br>\n"; 
    } 

}

來源

2014-02-08 11:38:28 Grant

謝謝，但其顯示警告，如「警告：DOMDocument :: loadHTML（）：意外的結束標記：td 注意：DOMDocument :: loadHTML（）：命名空間前綴fb 警告：DOMDocument :: loadHTML（）：標記fb：評論「 –

您需要正確加載$ html文件內容。 – Grant

我認爲，它是因爲html標記中的錯誤？ –

如何使用正則表達式從html標記中提取網址和文本

回答

相關問題