提取網站中的所有文本以建立一致性

如何獲取網站中的所有文本，而不僅僅是指ctrl + a/c。我希望能夠從網站（和所有關聯的網頁）中提取所有文本，並使用它來構建來自該網站的文字的一致性。有任何想法嗎？提取網站中的所有文本以建立一致性

2013-08-04 Luke Thompson

我很喜歡這個，所以我寫了解決方案的第一部分。

代碼是用PHP編寫的，因爲strip_tags函數很方便。這也很粗糙和程序化，但我感覺在展示我的想法。

<?php 
$url = "http://www.stackoverflow.com"; 

//To use this you'll need to get a key for the Readabilty Parser API http://readability.com/developers/api/parser 
$token = ""; 

//I make a HTTP GET request to the readabilty API and then decode the returned JSON 
$parserResponse = json_decode(file_get_contents("http://www.readability.com/api/content/v1/parser?url=$url&token=$token")); 

//I'm only interested in the content string in the json object 
$content = $parserResponse->content; 

//I strip the HTML tags for the article content 
$wordsOnPage = strip_tags($content); 

$wordCounter = array(); 

$wordSplit = explode(" ", $wordsOnPage); 

//I then loop through each word in the article keeping count of how many times I've seen the word 
foreach($wordSplit as $word) 
{ 
incrementWordCounter($word); 
} 

//Then I sort the array so the most frequent words are at the end 
asort($wordCounter); 

//And dump the array 
var_dump($wordCounter); 

function incrementWordCounter($word) 
{ 
    global $wordCounter; 

    if(isset($wordCounter[$word])) 
    { 
    $wordCounter[$word] = $wordCounter[$word] + 1; 
    } 
    else 
    { 
    $wordCounter[$word] = 1; 
    } 

} 


?>

我需要做this來爲可讀性API使用的SSL配置PHP。

解決方案的下一步將是搜索頁面中的鏈接，並以智能的方式遞歸地調用該頁面以滿足關聯頁面的要求。

此外，上面的代碼只是給出了一個字數的原始數據，您可能想要對它進行一些處理以使其具有意義。

來源

2013-08-04 01:59:23 Joel

提取網站中的所有文本以建立一致性

回答

相關問題