如何在PHP中處理相對較大的數組？

我有超過5000個文本文件的大集合，有超過200,000個單詞。問題是，當我嘗試將整個集合組合到一個數組中時，爲了查找集合中的唯一字，沒有顯示輸出（這是由於數組的大小很大）。下面的一段代碼對於小號沒有問題。的集合，例如30個文件，但不能在非常大的集合上操作。幫我解決這個問題。由於如何在PHP中處理相對較大的數組？

<?php 
ini_set('memory_limit', '1024M'); 
$directory = "archive/"; 
$dir = opendir($directory); 
$file_array = array(); 
while (($file = readdir($dir)) !== false) { 
    $filename = $directory . $file; 
    $type = filetype($filename); 
    if ($type == 'file') { 
    $contents = file_get_contents($filename); 
    $text = preg_replace('/\s+/', ' ', $contents); 
    $text = preg_replace('/[^A-Za-z0-9\-\n ]/', '', $text); 
    $text = explode(" ", $text); 
    $text = array_map('strtolower', $text); 
    $stopwords = array("a", "an", "and", "are", "as", "at", "be", "by", "for", "is", "to"); 
    $text = (array_diff($text,$stopwords)); 
    $file_array = array_merge($file_array, $text); 
    } 
} 
closedir($dir); 
$total_word_count = count($file_array); 
$unique_array = array_unique($file_array); 
$unique_word_count = count($unique_array); 
echo "Total Words: " . $total_word_count."<br>"; 
echo "Unique Words: " . $unique_word_count; 
?>

的文本文件數據集可以在這裏找到：https://archive.ics.uci.edu/ml/machine-learning-databases/00217/C50.zip

來源

2014-07-08 user3814982

你有沒有試圖使內存限制高？ – putvande

我有2GB的內存。 – user3814982

您是否嘗試過使用XML文件或CSV？ – M98

爲了替代多個數組，只需構建一個，然後僅在插入這些單詞時對其進行填充並對它們進行計數。這會更快，你甚至會有每個單詞的計數。

順便說一句，你還需要爲空字符串添加到禁用詞列表，或者調整你的邏輯，以避免採取1英寸

<?php 
$directory = "archive/"; 
$dir = opendir($directory); 
$wordcounter = array(); 
while (($file = readdir($dir)) !== false) { 
    if (filetype($directory . $file) == 'file') { 
    $contents = file_get_contents($directory . $file); 
    $text = preg_replace('/\s+/', ' ', $contents); 
    $text = preg_replace('/[^A-Za-z0-9\-\n ]/', '', $text); 
    $text = explode(" ", $text); 
    $text = array_map('strtolower', $text); 
    foreach ($text as $word) 
     if (!isset($wordcounter[$word])) 
      $wordcounter[$word] = 1; 
     else 
      $wordcounter[$word]++; 
    } 
} 
closedir($dir); 

$stopwords = array("", "a", "an", "and", "are", "as", "at", "be", "by", "for", "is", "to"); 
foreach($stopwords as $stopword) 
    unset($wordcounter[$stopword]); 

$total_word_count = array_sum($wordcounter); 
$unique_word_count = count($wordcounter); 
echo "Total Words: " . $total_word_count."<br>"; 
echo "Unique Words: " . $unique_word_count."<br>"; 

// bonus: 
$max = max($wordcounter); 
echo "Most used word is used $max times: " . implode(", ", array_keys($wordcounter, $max))."<br>"; 
?>

來源

2014-07-08 11:03:52

** DEMO **：http://codepad.org/T5pnOSKH –

此代碼效果更好。感謝那。 – user3814982

我建議你添加以下停用詞：「」，「」，「in」。如果你這樣做，你將有53'993個獨特的單詞被使用1'957'286次。「說」這個詞最多用於39'973次。此腳本在我的計算機上運行不到8秒鐘，處理您的5000個文件（14.8MB）。 –

爲什麼將所有陣列的一個大沒用陣列？

您可以使用array_unique函數從數組中獲取唯一值，然後將其與來自文件的下一個數組進行連接並再次應用相同的函數。

來源

2014-07-08 10:14:41 Justinas

不要將內存限制增加到高。這通常不是最好的解決方案。

你應該如何一行一行地加載文件（在處理格式爲CSV時很容易在PHP中），計算單行（或者一小束一行）並寫入輸出文件。這樣，您可以使用少量內存來處理大量的輸入數據。

在任何情況下都可以嘗試找到一種方法，將完整的輸入拆分爲可以在不增加內存限制的情況下工作的小塊。

來源

2014-07-08 10:20:59 feeela

另一種方法是將所有內容加載到數據庫表中，然後讓數據庫服務器處理得最多。

或者在塊中處理行並標記完成的行或將它們聚合到另一個表中。

來源

2014-07-08 10:36:57 DanFromGermany

如何在PHP中處理相對較大的數組？

回答

相關問題