如何從大量文本中獲得最流行的短語？

我正在爲我的論壇設置Twitter風格的「趨勢主題」框。我有最受歡迎的/言辭/，但甚至不能開始思考我將如何獲得熱門詞組，比如Twitter。如何從大量文本中獲得最流行的短語？

既然這樣我只是得到最後的200個職位的所有內容轉換爲字符串，並將其分割成詞，然後排序由哪些詞使用最多。我怎樣才能從最流行的詞彙中把這個變成最流行的詞彙呢？

2010-10-13 katoth

真的取決於你要定義爲「短語」 – 2010-10-13 20:31:43

如何膠合二/三/四言聯到一個什麼呢？它仍然是O（n）。 – 2010-10-13 20:34:36

我不認爲你會發現對計算器幾行代碼你的答案..這個問題是一個命題科目可能與網絡語義 – pleasedontbelong 2010-10-13 20:43:23

一種技術是在Redis的使用ZSETs的這樣的事情。如果你有非常大的數據集，你會發現，你可以做這樣的事情：

$words = explode(" ", $input); // Pseudo-code for breaking a block of data into individual words. 
$word_count = count($words); 

$r = new Redis(); // Owlient's PHPRedis PECL extension 
$r->connect("127.0.0.1", 6379); 

function process_phrase($phrase) { 
    global $r; 
    $phrase = implode(" ", $phrase); 
    $r->zIncrBy("trending_phrases", 1, $phrase); 
} 

for($i=0;$i<$word_count;$i++) 
    for($j=1;$j<$word_count - $i;$j++) 
     process_phrase(array_slice($words, $i, $j));

要檢索的頂部短語，你會使用這樣的：

// Assume $r is instantiated like it is above 
$trending_phrases = $r->zReverseRange("trending_phrases", 0, 10);

$trending_phrases將成爲排名前十的熱門詞組的一部分。要執行最近的趨勢詞組（而不是持久的全局詞組），請複製上面所有的Redis交互。對於每次交互，請使用指示當前時間戳和明天時間戳（即1970年1月1日以來的天數）的密鑰。當用$trending_phrases檢索結果時，只需檢索今天和明天（或昨天）的密鑰，並使用array_merge和array_unique來查找聯合。

希望這會有所幫助！

來源

2010-10-14 04:13:23 mattbasta

而不是分裂個別單詞拆分在個別短語，就這麼簡單。

$popular = array(); 

foreach ($tweets as $tweet) 
{ 
    // split by common punctuation chars 
    $sentences = preg_split('~[.!?]+~', $string); 

    foreach ($sentences as $sentence) 
    { 
     $sentence = strtolower(trim($sentence)); // normalize sentences 

     if (isset($popular[$sentence]) === false) 
     //if (array_key_exists($sentence, $popular) === false) 
     { 
      $popular[$sentence] = 0; 
     } 

     $popular[$sentence]++; 
    } 
} 

arsort($popular); 

echo '<pre>'; 
print_r($popular); 
echo '</pre>';

如果你考慮一個詞組的ñ連續字的聚集這將是慢了很多。你可能會考慮

來源

2010-10-13 20:36:57

就性能而言，'array_key_exists（$ sentence，$ popular）！== true'比'！isset（$ popular [$ sentence]）'慢整個數量級。在這種情況下，功能差異並不重要。 – mattbasta 2010-10-14 03:59:09

@mattbasta：確實。但是，一個數量級更慢？如慢10倍？你有任何顯示這些結果的基準嗎？ – 2010-10-14 09:25:53

我沒有什麼好用的，但是我有更大陣列（1000多個元素）的經驗，'isset'需要50ms以內，'array_key_exists'可能需要300-400ms。 – mattbasta 2010-10-14 14:00:09

林不知道你在找什麼類型的答案的，但Laconica：

http://status.net/?source=laconica

是一個開源的Twitter克隆（更簡單的版本）。

也許你可以使用代碼的一部分，使自己的流行frases？

祝你好運！

來源

2010-10-14 04:20:06 Trufa

如何從大量文本中獲得最流行的短語？

回答

相關問題