php從html頁面獲取正文

我想從完整的html代碼中剝離一些html-body代碼。php從html頁面獲取正文

我使用下面的腳本。

<?php  
    function getbody($filename) { 
     $file = file_get_contents($filename); 

     $bodystartpattern = ".*<body>"; 
     $bodyendpattern = "</body>.*"; 

     $noheader = eregi_replace($bodystartpattern, "", $file); 

     $noheader = eregi_replace($bodyendpattern, "", $noheader); 

     return $noheader; 
    } 
    $bodycontent = getbody($_GET['url']); 
?>

但在某些情況下，標籤<body>不字面上存在，但標籤可能是<body style="margin:0;">什麼的。誰能告訴我在這種情況下通過在$ bodystartpattern中使用正則表達式來尋找body-tag的解決方案，該正則表達式查找開始body標籤的關閉 - 「>」？

來源

2014-06-25 Guido Lemmens 2

旁註：['eregi_replace（）']（http://www.php.net//manual/en/function.eregi-replace.php）該函數已被棄用的PHP 5.3.0 。依靠這個功能是非常不鼓勵的。 –

檢查[這個答案]（http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#answer-1732454）使用正則表達式來解析HTML ... –

@ 1nflktd我曾嘗試下面的代碼。

<?php 
    header('Content-Type:text/html; charset=UTF-8'); 

    function getbody($filename) { 
     $file = file_get_contents($filename);  
     $dom = new DOMDocument; 
     $dom->loadHTML($file); 
     $bodies = $dom->getElementsByTagName('body'); 
     assert($bodies->length === 1); 
     $body = $bodies->item(0); 
     for ($i = 0; $i < $body->children->length; $i++) { 
      $body->remove($body->children->item($i)); 
     } 
     $stringbody = $dom->saveHTML($body); 
     return $stringbody; 
    } 

    $url = "http://www.barcelona.com/"; 
    $bodycontent = getbody($url); 
?> 

<html> 
<head></head> 
<body> 
<?php 
    echo "BODY ripped from: ".$url."<br/>"; 
    echo "<textarea rows='40' cols='200' >".$bodycontent."</textarea>"; 
?> 
</body> 
</html>

來源

2014-06-26 00:06:24

我只是在我的機器上試過你的代碼，它工作正常。你有沒有犯錯誤？如果您沒有啓用錯誤，請執行此操作。 –

它在我的機器上不起作用。您可以在http://www.kunstplantenonline.nl/test/test.php上看到此腳本，並查看php-warnings。 –

檢查此http://stackoverflow.com/questions/9149180/domdocumentloadhtml-error，並檢查我更新的答案 –

爲什麼不使用html解析器？

function getbody($filename) { 
    $file = file_get_contents($filename); 

    $dom = new DOMDocument(); 
    libxml_use_internal_errors(true); 
    $dom->loadHTML($file); 
    libxml_use_internal_errors(false); 
    $bodies = $dom->getElementsByTagName('body'); 
    assert($bodies->length === 1); 
    $body = $bodies->item(0); 
    for ($i = 0; $i < $body->children->length; $i++) { 
     $body->remove($body->children->item($i)); 
    } 
    $stringbody = $dom->saveHTML($body); 
    return $stringbody; 
}

DOM loadHTML reference

來源

2014-06-25 18:15:44

我已經複製了你的代碼，但現在它什麼也沒有返回......任何想法？ –

@ GuidoLemmens2你有沒有得到任何PHP代碼裏面..更具體一些'$'？它可能會破壞事情。你有錯誤報告嗎？你從它得到一些迴應？ –

您是否看到我在下一封郵件中粘貼的代碼？ –

php從html頁面獲取正文

回答

相關問題