如何在沒有javascript代碼的情況下獲得正文內容

要獲取body標籤中的內容，我使用下面的代碼。如何在沒有javascript代碼的情況下獲得正文內容

$html = @file_get_contents($url); 
$doc = new DOMDocument(); 
@$doc->loadHTML($html); 
$nodes = $doc->getElementsByTagName('body'); 
$body = $nodes->item(0)->nodeValue;

如何從$ body中刪除js代碼？任何JS代碼，看起來像

<script> /*Some js code*/ </script>

來源

2015-12-30 Lomse

已經問： http://stackoverflow.com/questions/7130867/remove-script-tag-from-html-content – nullexception

解決方案here已解決我的問題。下面完全的代碼刪除腳本標記和身體標記及其內容：

$doc = new DOMDocument(); 
    $doc->preserveWhiteSpace = false; 
    @$doc->loadHTML($html); 
    $script = $doc->getElementsByTagName('script'); 

    $remove = []; 
    foreach ($script as $item) { 
     $remove[] = $item; 
    } 

    foreach ($remove as $item) { 
     $item->parentNode->removeChild($item); 
    } 

    $node = $doc->getElementsByTagName('body'); 
    $body = $node->item(0)->nodeValue; 

    echo $body;

來源

2015-12-30 13:36:54 Lomse

試試這個：

$html = preg_replace("/<script.*?\/script>/s", "", $html);

在做正則表達式的事情可能出錯，所以它的安全這樣做：

$html = preg_replace("/<script.*?\/script>/s", "", $html) ? : $html;

所以當「事故」發生時，我們得到原始的$html而不是空字符串。

來源

2015-12-30 10:53:40 Manikiran

這隻刪除腳本標記，但保留javascript內容。這個想法是刪除腳本標記和JavaScript內容。 – Lomse

如果你已經使用DOMDocument那你爲什麼不移除節點？

$dom = new DOMDocument; 
$dom->preserveWhiteSpace = false; 
@$dom->loadHTMLFile("from_link_to.html"); 
$scripts = $dom->getElementsByTagName('script'); 
foreach ($scripts as $script) { 
    $scripts->removeChild($script); 
} 
...

採取仔細看看The DOMDocument class和方式regular expression是這樣的任務噩夢。

來源

2015-12-30 11:13:43

如何在沒有javascript代碼的情況下獲得正文內容

回答

相關問題