2012-08-09 64 views
0
<?php 



$filename = "largefile.txt"; 



/* get content of $filename in $content */ 

$content = strtolower(file_get_contents($filename)); 



/* split $content into array of substrings of $content i.e wordwise */ 

$wordArray = preg_split('/[^a-z]/', $content, -1, PREG_SPLIT_NO_EMPTY); 



/* "stop words", filter them */ 

$filteredArray = array_filter($wordArray, function($x){ 

return !preg_match("/^(.|a|an|and|the|this|at|in|or|of|is|for|to)$/",$x); 

}); 



/* get associative array of values from $filteredArray as keys and their frequency count as value */ 

$wordFrequencyArray = array_count_values($filteredArray); 



/* Sort array from higher to lower, keeping keys */ 

arsort($wordFrequencyArray); 

這是我的代碼,我已經實現了查找文件中不同詞的頻率。 這是行得通的。計算多個文件中的詞頻

現在我想要做的是,讓我們假設有10個文本文件。我想要統計所有10個文件中的一個單詞的詞頻,即如果我想要查找所有單詞「堆棧」的頻率10個文件,即單詞堆棧在所有文件中出現的次數。然後將爲所有不同的單詞執行此操作。

我已經完成了單個文件,但不能如何將其擴展到多個文件。 感謝您的幫助和抱歉,我的英語不好

+0

你試過包裹了整個事情在每個文件的循環中? – Scuzzy 2012-08-09 06:16:31

回答

2

放什麼你已經陷入了功能&使用foreach循環調用它的每個文件名中的數組:

<?php 

$wordFrequencyArray = array(); 

function countWords($file) use($wordFrequencyArray) { 
    /* get content of $filename in $content */ 
    $content = strtolower(file_get_contents($filename)); 

    /* split $content into array of substrings of $content i.e wordwise */ 
    $wordArray = preg_split('/[^a-z]/', $content, -1, PREG_SPLIT_NO_EMPTY); 

    /* "stop words", filter them */ 
    $filteredArray = array_filter($wordArray, function($x){ 
     return !preg_match("/^(.|a|an|and|the|this|at|in|or|of|is|for|to)$/",$x); 
    }); 

    /* get associative array of values from $filteredArray as keys and their frequency count as value */ 
    foreach (array_count_values($filteredArray) as $word => $count) { 
     if (!isset($wordFrequencyArray[$word])) $wordFrequencyArray[$word] = 0; 
     $wordFrequencyArray[$word] += $count; 
    } 
} 
$filenames = array('file1.txt', 'file2.txt', 'file3.txt', 'file4.txt' ...); 
foreach ($filenames as $file) { 
    countWords($file); 
} 

print_r($wordFrequencyArray);