PHP從html頁面提取所有文本

過去1小時，我一直在摸它的頭。是否有任何可靠的方法從HTML頁面提取僅文本PHP從html頁面提取所有文本

以及其他任何內容（代碼，圖像，鏈接，樣式，腳本）。我試圖提取html文檔正文內的所有文本。

這包括段落，純文本和表格數據..

到目前爲止，我已經嘗試simplehtmldom解析器也file_get_contents但他們兩人都沒有工作。這裏是代碼：

<?php 

require_once "simple_html_dom.php"; 

function getplaintextintrofromhtml($html) { 

    // Remove the HTML tags 
    $html = strip_tags($html); 

    // Convert HTML entities to single characters 
    $html = html_entity_decode($html, ENT_QUOTES, 'UTF-8'); 

    return $html; 

} 

$html = file_get_contents('http://www.thefreedictionary.com/contempt'); 

echo getplaintextintrofromhtml($html); 
?>

下面是輸出的截圖：

https://docs.google.com/file/d/0B-b63LoI1gSfaGhpR0NvdUtlbW8/edit?usp=drivesdk

正如你可以看到它顯示的是奇怪的輸出甚至沒有顯示整個頁面的文本

來源

2014-11-25 Ahmar Ali

http://php.net/manual/en/book.curl.php和用strip_tags（） – EL3PHANTEN 2014-11-25 10:42:54

你想要提取什麼？其不清楚。最終的輸出應該是什麼？ ''內的內容？ – Ghost 2014-11-25 10:48:08

編輯我的問題@Ghost – 2014-11-25 10:49:55

我不爲什麼你認爲SimpleHTMLDOM不工作，但你必須要正確地使用它，只是針對身體，然後用->innertext屬性：

function getplaintextintrofromhtml($url) { 
    include 'simple_html_dom.php'; 

    $html = file_get_html($url); 
    // point to the body, then get the innertext 
    $data = $html->find('body', 0)->innertext; 
    return $data; 
} 

echo getplaintextintrofromhtml('http://www.thefreedictionary.com/contempt');

來源

2014-11-25 10:55:49 Ghost

我認爲PHP簡單HTML DOM解析器是最快最容易的方法嘗試 http://simplehtmldom.sourceforge.net/

features 
A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way! 
Require PHP 5+. 
Supports invalid HTML. 
Find tags on an HTML page with selectors just like jQuery. 
Extract contents from HTML in a single line

來源

2014-11-25 10:52:29

Html2Text僅僅是一個好的圖書館爲了那個原因。

https://github.com/mtibben/html2text

安裝用作曲：

composer require html2text/html2text

基本用法：

$html = new \Html2Text\Html2Text('Hello, &quot;<b>world</b>&quot;'); 

echo $html->getText(); // Hello, "WORLD"

來源

2017-03-27 10:18:52

PHP從html頁面提取所有文本

回答

相關問題