如何從CSV文件中刪除重複的行？

有沒有簡單的方法來查找並從CSV文件中刪除重複的行？如何從CSV文件中刪除重複的行？

樣品test.csv文件：

row1 test tyy...... 
row2 tesg ghh 
row2 tesg ghh 
row2 tesg ghh 
.... 
row3 tesg ghh 
row3 tesg ghh 
... 
row4 tesg ghh

預期結果：

row1 test tyy...... 
row2 tesg ghh 
.... 
row3 tesg ghh 
... 
row4 tesg ghh

我在哪裏可以開始使用PHP來做到這一點？

來源

2012-12-28 user1932607

到目前爲止你做了什麼？ – Champ

確實所有重複的線條確實連續出現？ – Cups

直接指向的方法是逐行讀取文件並跟蹤以前看過的每一行。如果當前行已經被看到，請跳過它。

這樣（未經測試）代碼的東西可能工作：

<?php 
// array to hold all "seen" lines 
$lines = array(); 

// open the csv file 
if (($handle = fopen("test.csv", "r")) !== false) { 
    // read each line into an array 
    while (($data = fgetcsv($handle, 8192, ",")) !== false) { 
     // build a "line" from the parsed data 
     $line = join(",", $data); 

     // if the line has been seen, skip it 
     if (isset($lines[$line])) continue; 

     // save the line 
     $lines[$line] = true; 
    } 
    fclose($handle); 
} 

// build the new content-data 
$contents = ''; 
foreach ($lines as $line => $bool) $contents .= $line . "\r\n"; 

// save it to a new file 
file_put_contents("test_unique.csv", $contents); 
?>

該代碼使用fgetcsv()和使用空間逗號作爲列分隔符（基於抽樣數據在問題評論）。

如上所述，存儲已經被看到的每一行都將確保刪除文件中的所有重複行，而不管它們是否直接跟隨或不跟蹤。如果他們總是會背對背，一個更簡單的方法（和更多的記憶意識）將只存儲最後看到的行，然後與當前行進行比較。

UPDATE
基於在註釋提供的樣本數據中，「重複行」實際上不是相等的（儘管它們是相似的，它們之間的區別（經由SKU-列，而不是整行重複行）由很多列組成）。它們之間的相似性可以鏈接到單個列，即sku。

以下是上述代碼的擴展版本。該塊將解析CSV文件的第一行（列列表），以確定哪一列包含sku代碼。從那裏，它會不斷的看到SKU碼獨特的列表，如果當前行有一個「新」的代碼，它將使用fputcsv()寫該行新的「獨一無二」的文件：

<?php 
// array to hold all unique lines 
$lines = array(); 

// array to hold all unique SKU codes 
$skus = array(); 

// index of the `sku` column 
$skuIndex = -1; 

// open the "save-file" 
if (($saveHandle = fopen("test_unique.csv", "w")) !== false) { 
    // open the csv file 
    if (($readHandle = fopen("test.csv", "r")) !== false) { 
     // read each line into an array 
     while (($data = fgetcsv($readHandle, 8192, ",")) !== false) { 
      if ($skuIndex == -1) { 
       // we need to determine what column the "sku" is; this will identify 
       // the "unique" rows 
       foreach ($data as $index => $column) { 
        if ($column == 'sku') { 
         $skuIndex = $index; 
         break; 
        } 
       } 
       if ($skuIndex == -1) { 
        echo "Couldn't determine the SKU-column."; 
        die(); 
       } 
       // write this line to the file 
       fputcsv($saveHandle, $data); 
      } 

      // if the sku has been seen, skip it 
      if (isset($skus[$data[$skuIndex]])) continue; 
      $skus[$data[$skuIndex]] = true; 

      // write this line to the file 
      fputcsv($saveHandle, $data); 
     } 
     fclose($readHandle); 
    } 
    fclose($saveHandle); 
} 
?>

總體而言，這方法的內存更加友好，因爲它不需要保存每行內存的副本（僅限SKU代碼）。

來源

2012-12-28 15:37:25 newfurniturey

重複只應刪除行重複。如果同一行在CSV中稍後回來，則應包括在內。 –

@ArnoldDaniels我在這篇文章中看不到這樣的說法。請讓我知道你從哪裏收到這些信息，並且我可以相應地更新我的答案。 – newfurniturey

看看'預期結果'。您可以看到非唯一的行。 –

如何從CSV文件中刪除重複的行？

回答

相關問題