2010-10-23 143 views
12

我需要你幫忙。如何替換文本網址並排除HTML標記中的網址?

我希望把這個:

sometext sometext http://www.somedomain.com/index.html sometext sometext 

到:

sometext sometext <a href="http://somedoamai.com/index.html">www.somedomain.com/index.html</a> sometext sometext 

我已經使用這個正則表達式管理它:

preg_replace("#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie", "'<a href=\"$1\" target=\"_blank\">$1</a>$4'", $text); 

的問題是它也取代了img網址,例如:

sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext 

變成:

sometext sometext <img src="<a href="http//domain.com/image.jpg">domain.com/image.jpg</a>"> sometext sometext 

請幫助。

+0

可能的重複[你能提供一些爲什麼很難用正則表達式解析XML和HTML的例子嗎?](http:// stackoverflow。com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-rege) – 2011-07-09 20:54:43

+0

[RegEx match open標籤除XHTML自包含標籤](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – 2011-09-15 14:15:12

回答

3

你不應該用正則表達式來做到這一點 - 至少不是正則表達式。改爲使用適當的HTML DOM解析器,如PHP’s DOM library。然後,您可以迭代節點,檢查它是否是文本節點,並執行正則表達式搜索並適當地替換文本節點。

像這樣的東西應該這樣做:

$pattern = "~((?:http|https|ftp)://(?:\S*?\.\S*?))(?=\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)~i"; 
$doc = new DOMDocument(); 
$doc->loadHTML($str); 
// for every element in the document 
foreach ($doc->getElementsByTagName('*') as $elem) { 
    // for every child node in each element 
    foreach ($elem->childNodes as $node) { 
     if ($node->nodeType === XML_TEXT_NODE) { 
      // split the text content to get an array of 1+2*n elements for n URLs in it 
      $parts = preg_split($pattern, $node->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE); 
      $n = count($parts); 
      if ($n > 1) { 
       $parentNode = $node->parentNode; 
       // insert for each pair of non-URL/URL parts one DOMText and DOMElement node before the original DOMText node 
       for ($i=1; $i<$n; $i+=2) { 
        $a = $doc->createElement('a'); 
        $a->setAttribute('href', $parts[$i]); 
        $a->setAttribute('target', '_blank'); 
        $a->appendChild($doc->createTextNode($parts[$i])); 
        $parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node); 
        $parentNode->insertBefore($a, $node); 
       } 
       // insert the last part before the original DOMText node 
       $parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node); 
       // remove the original DOMText node 
       $node->parentNode->removeChild($node); 
      } 
     } 
    } 
} 

好吧,既然getElementsByTagNameDOMNodeList‍schildNodeslive,在DOM中的每一個變化反映到該列表中,因此你不能用foreach,這也將遍歷新添加的節點。相反,您需要使用for循環,並跟蹤添加的元素以適當地增加索引指針和最好預先計算的數組邊界。

但因爲這是在這樣一個莫名其妙複雜的算法相當困難的(你需要一個索引指針和數組邊界爲三個for循環),使用遞歸算法更方便:

function mapOntoTextNodes(DOMNode $node, $callback) { 
    if ($node->nodeType === XML_TEXT_NODE) { 
     return $callback($node); 
    } 
    for ($i=0, $n=count($node->childNodes); $i<$n; ++$i) { 
     $nodesChanged = 0; 
     switch ($node->childNodes->item($i)->nodeType) { 
      case XML_ELEMENT_NODE: 
       $nodesChanged = mapOntoTextNodes($node->childNodes->item($i), $callback); 
       break; 
      case XML_TEXT_NODE: 
       $nodesChanged = $callback($node->childNodes->item($i)); 
       break; 
     } 
     if ($nodesChanged !== 0) { 
      $n += $nodesChanged; 
      $i += $nodesChanged; 
     } 
    } 
} 
function foo(DOMText $node) { 
    $pattern = "~((?:http|https|ftp)://(?:\S*?\.\S*?))(?=\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)~i"; 
    $parts = preg_split($pattern, $node->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE); 
    $n = count($parts); 
    if ($n > 1) { 
     $parentNode = $node->parentNode; 
     $doc = $node->ownerDocument; 
     for ($i=1; $i<$n; $i+=2) { 
      $a = $doc->createElement('a'); 
      $a->setAttribute('href', $parts[$i]); 
      $a->setAttribute('target', '_blank'); 
      $a->appendChild($doc->createTextNode($parts[$i])); 
      $parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node); 
      $parentNode->insertBefore($a, $node); 
     } 
     $parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node); 
     $parentNode->removeChild($node); 
    } 
    return $n-1; 
} 

$str = '<div>sometext http://www.somedomain.com/index.html sometext <img src="http//domain.com/image.jpg"> sometext sometext</div>'; 
$doc = new DOMDocument(); 
$doc->loadHTML($str); 
$elems = $doc->getElementsByTagName('body'); 
mapOntoTextNodes($elems->item(0), 'foo'); 

這裏使用mapOntoTextNodes將給定的回調函數映射到DOM文檔中的每個DOMText節點上。您可以傳遞整個DOMDocument節點或僅傳遞一個特定的DOMNode(在本例中僅爲BODY節點)。

功能foo然後被用於發現和通過分割內容串入非URL/URL份使用preg_split同時捕捉所產生的用於分隔符替換一個DOMText節點的內容平原網址在1 + 2·n項目的數組中。然後非URL部分由新一個DOMText節點代替,URL部分由新A元素然後原點一個DOMText節點,其隨後在端部除去之前插入替換。由於這個mapOntoTextNodes遞歸地走,只需在特定的DOMNode上調用該函數就足夠了。

+0

感謝您的答案,但我需要使用正則表達式,因爲它更輕,更快評價使用幾個函數 – Andri 2010-10-23 09:53:10

+6

@Andri:但使用正則表達式可能會給出意想不到的結果,因爲HTML是一種不規則的語言。 – Gumbo 2010-10-23 10:10:28

1

感謝您的回覆,但它仍然有效。

function livelinked ($text){ 
     preg_match_all("#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)|^(jpg)#ie", $text, $ccs); 
     foreach ($ccs[3] as $cc) { 
      if (strpos($cc,"jpg")==false && strpos($cc,"gif")==false && strpos($cc,"png")==false) { 
       $old[] = "http://".$cc; 
       $new[] = '<a href="http://'.$cc.'" target="_blank">'.$cc.'</a>'; 
      } 
     } 
     return str_replace($old,$new,$text); 
} 
6

精簡版本秋葵的上面的:

$html = <<< HTML 
<html> 
<body> 
<p> 
    This is a text with a <a href="http://example.com/1">link</a> 
    and another <a href="http://example.com/2">http://example.com/2</a> 
    and also another http://example.com with the latter being the 
    only one that should be replaced. There is also images in this 
    text, like <img src="http://example.com/foo"/> but these should 
    not be replaced either. In fact, only URLs in text that is no 
    a descendant of an anchor element should be converted to a link. 
</p> 
</body> 
</html> 
HTML; 

讓我們使用XPath僅提取那些實際上是含有HTTP textnodes元素:我已使用此功能固定//或https://或者ftp://並且它們本身不是錨元素的textnode。

$dom = new DOMDocument; 
$dom->loadHTML($html); 
$xPath = new DOMXPath($dom); 
$texts = $xPath->query(
    '/html/body//text()[ 
     not(ancestor::a) and (
     contains(.,"http://") or 
     contains(.,"https://") or 
     contains(.,"ftp://"))]' 
); 

以上的XPath會給我們提供以下數據TextNode:

and also another http://example.com with the latter being the 
    only one that should be replaced. There is also images in this 
    text, like 

由於PHP5.3我們也可以use PHP inside the XPath使用正則表達式來選擇我們的節點,而不是三次調用包含的內容。

我們將使用document fragment而不是將文本節點拆分爲符合標準的方式,而只是將整個文本節點替換爲片段。在這種情況下,非標準僅表示the method we will be using for this,不是W3C specification of the DOM API的一部分。

foreach ($texts as $text) { 
    $fragment = $dom->createDocumentFragment(); 
    $fragment->appendXML(
     preg_replace(
      "~((?:http|https|ftp)://(?:\S*?\.\S*?))(?=\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)~i", 
      '<a href="$1">$1</a>', 
      $text->data 
     ) 
    ); 
    $text->parentNode->replaceChild($fragment, $text); 
} 
echo $dom->saveXML($dom->documentElement); 

,這會接着輸出:

<html><body> 
<p> 
    This is a text with a <a href="http://example.com/1">link</a> 
    and another <a href="http://example.com/2">http://example.com/2</a> 
    and also another <a href="http://example.com">http://example.com</a> with the latter being the 
    only one that should be replaced. There is also images in this 
    text, like <img src="http://example.com/foo"/> but these should 
    not be replaced either. In fact, only URLs in text that is no 
    a descendant of an anchor element should be converted to a link. 
</p> 
</body></html> 
0

如果您想使用正則表達式(在這種情況下,一個正則表達式是比較合適的),以保持,你只能有正則表達式匹配「獨立」的網址。使用word boundary escape sequence\b),你只能有正則表達式匹配,其中http立刻被前面的空格或文本的開始:

preg_replace("#\b((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie", "'<a href=\"$1\" target=\"_blank\">$1</a>$4'", $text); 
      // ^^ thar she blows 

因此,"http://..."將不匹配,但http://作爲自己的字會。

+1

它也不會匹配句子末尾的任何網址,例如隨後是句號或逗號分開枚舉的那些部分等等。不用說,HTML屬性甚至不需要引號。 – Gordon 2010-10-28 11:54:28

+1

字邊界的描述也不正確。如此處所示,'\ b'只聲明'http','https'或'ftp'不會立即以字母,數字或下劃線開頭。它會在''http'或'= http'中的'h'之前**匹配,因此它不會阻止屬性值中的匹配,因爲您似乎聲稱這些匹配屬於該屬性值。 – 2011-01-30 18:41:54

0

的DomDocument較爲成熟,運行速度更快,所以它只是如果有人想使用PHP Simple HTML DOM Parser一種替代方案:

<?php 
require_once('simple_html_dom.php'); 

$html = str_get_html('sometext sometext http://www.somedomain.com/index.html sometext sometext 
<a href="http://www.somedomain.com/index.html">http://www.somedomain.com/index.html</a> 
sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext'); 

foreach ($html->find('text') as $element) 
{ 
    // you can add any tag into the array to exclude from replace 
    if (!in_array($element->parent()->tag, array('a'))) 
     $element->innertext = preg_replace("#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie", "'<a href=\"$1\" target=\"_blank\">$1</a>$4'", $element->innertext); 
} 

echo $html; 
+1

建議第三方替代[SimpleHtmlDom]( http://simplehtmldom.sourceforge.net/)實際使用[DOM](http://php.net/manual/en/book.dom.php)而不是字符串分析:[phpQuery](http:// code .google.com/p/phpquery /),[Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html),[QueryPath](http://querypath.org/)和[FluentDom](http://www.fluentdom.org) – Gordon 2010-11-16 10:57:42

+2

@戈登:編輯表明DomDocument是一種更好的方法... – 2010-11-19 23:41:44

0

您可以從this question試試我的代碼:

echo preg_replace('/<a href="([^"]*)([^<\/]*)<\/a>/i', "$1", 'sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext'); 

如果你想轉一些其他標籤 - 這很容易:

echo preg_replace('/<img src="([^"]*)([^\/><]*)>/i', "$1", 'sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext'); 
0

匹配在啓動和URL字符串的端部的空白(\ S),這將確保

"http://url.com" 

不受

http://url.com 

匹配匹配;