2013-03-03 43 views
2

重複行我有一個包含不同數量的我使用SuperCSV CsvBeanReader執行映射和細胞驗證了該CSV文件,多個項目。我已經爲每個csv文件創建一個bean並覆蓋;等於,每個bean的hashCode和toString。建議查找和報告CSV

我要找什麼可能是最好的「全工程」的實施方法進行CSV行重複鑑定的建議。報告(不刪除)原始csv行號和行內容,以及找到的所有重複行的行號和行內容。一些文件可以達到數十萬行,超過GB加上大小,並希望最大限度地減少每個文件的讀取次數,並認爲可以在CsvBeanReader打開文件時完成。

預先感謝您。

回答

2

鑑於你的文件的大小和所需的正本與副本的在線內容的事實,我認爲你能做的最好的是2個越過文件。

如果你只想要一個重複的最新行的內容,你可以逃脫1通。跟蹤原始行的內容以及1遍中的所有重複內容意味着您必須存儲每一行​​的內容 - 您可能會耗盡內存。

我的解決方案假定兩個bean具有相同hashCode()是重複的。如果您必須使用equals(),那麼它會變得更加複雜。

  • 通1:找出重複的(記錄行數爲每 重複哈希)

  • 通2:在重複

通1報告:識別重複

/** 
* Finds the row numbers with duplicate records (using the bean's hashCode() 
* method). The key of the returned map is the hashCode and the value is the 
* Set of duplicate row numbers for that hashcode. 
* 
* @param reader 
*   the reader 
* @param preference 
*   the preferences 
* @param beanClass 
*   the bean class 
* @param processors 
*   the cell processors 
* @return the map of duplicate rows (by hashcode) 
* @throws IOException 
*/ 
private static Map<Integer, Set<Integer>> findDuplicates(
    final Reader reader, final CsvPreference preference, 
    final Class<?> beanClass, final CellProcessor[] processors) 
    throws IOException { 

    ICsvBeanReader beanReader = null; 
    try { 
    beanReader = new CsvBeanReader(reader, preference); 

    final String[] header = beanReader.getHeader(true); 

    // the hashes of any duplicates 
    final Set<Integer> duplicateHashes = new HashSet<Integer>(); 

    // the hashes for each row 
    final Map<Integer, Set<Integer>> rowNumbersByHash = 
     new HashMap<Integer, Set<Integer>>(); 

    Object o; 
    while ((o = beanReader.read(beanClass, header, processors)) != null) { 
     final Integer hashCode = o.hashCode(); 

     // get the row no's for the hash (create if required) 
     Set<Integer> rowNumbers = rowNumbersByHash.get(hashCode); 
     if (rowNumbers == null) { 
     rowNumbers = new HashSet<Integer>(); 
     rowNumbersByHash.put(hashCode, rowNumbers); 
     } 

     // add the current row number to its hash 
     final Integer rowNumber = beanReader.getRowNumber(); 
     rowNumbers.add(rowNumber); 

     if (rowNumbers.size() == 2) { 
     duplicateHashes.add(hashCode); 
     } 

    } 

    // create a new map with just the duplicates 
    final Map<Integer, Set<Integer>> duplicateRowNumbersByHash = 
     new HashMap<Integer, Set<Integer>>(); 
    for (Integer duplicateHash : duplicateHashes) { 
     duplicateRowNumbersByHash.put(duplicateHash, 
      rowNumbersByHash.get(duplicateHash)); 
    } 

    return duplicateRowNumbersByHash; 

    } finally { 
    if (beanReader != null) { 
     beanReader.close(); 
    } 
    } 
} 

作爲替代這種方法,你可以使用一個CsvListReader並利用getUntokenizedRow().hashCode() - 這將計算基於原始CSV字符串的哈希(這將是快了很多,但你的數據可能有意味着不會細微的差別工作)。

通2:重複

這個方法報告採用前一種方法的輸出,並使用它來快速識別重複的記錄,它複製了其他行。

/** 
    * Reports the details of duplicate records. 
    * 
    * @param reader 
    *   the reader 
    * @param preference 
    *   the preferences 
    * @param beanClass 
    *   the bean class 
    * @param processors 
    *   the cell processors 
    * @param duplicateRowNumbersByHash 
    *   the row numbers of duplicate records 
    * @throws IOException 
    */ 
    private static void reportDuplicates(final Reader reader, 
     final CsvPreference preference, final Class<?> beanClass, 
     final CellProcessor[] processors, 
     final Map<Integer, Set<Integer>> duplicateRowNumbersByHash) 
     throws IOException { 

    ICsvBeanReader beanReader = null; 
    try { 
     beanReader = new CsvBeanReader(reader, preference); 

     final String[] header = beanReader.getHeader(true); 

     Object o; 
     while ((o = beanReader.read(beanClass, header, processors)) != null) { 
     final Set<Integer> duplicateRowNumbers = 
      duplicateRowNumbersByHash.get(o.hashCode()); 
     if (duplicateRowNumbers != null) { 
      System.out.println(String.format(
      "row %d is a duplicate of rows %s, line content: %s", 
      beanReader.getRowNumber(), 
      duplicateRowNumbers, 
      beanReader.getUntokenizedRow())); 
     } 

     } 

    } finally { 
     if (beanReader != null) { 
     beanReader.close(); 
     } 
    } 
    } 

樣品

這裏有一個如何使用2種方法的例子。

// rows (2,4,8) and (3,7) are duplicates 
    private static final String CSV = "a,b,c\n" + "1,two,01/02/2013\n" 
     + "2,two,01/02/2013\n" + "1,two,01/02/2013\n" 
     + "3,three,01/02/2013\n" + "4,four,01/02/2013\n" 
     + "2,two,01/02/2013\n" + "1,two,01/02/2013\n"; 

    private static final CellProcessor[] PROCESSORS = { new ParseInt(), 
     new NotNull(), new ParseDate("dd/MM/yyyy") }; 

    public static void main(String[] args) throws IOException { 

    final Map<Integer, Set<Integer>> duplicateRowNumbersByHash = findDuplicates(
     new StringReader(CSV), CsvPreference.STANDARD_PREFERENCE, 
     Bean.class, PROCESSORS); 

    reportDuplicates(new StringReader(CSV), 
     CsvPreference.STANDARD_PREFERENCE, Bean.class, PROCESSORS, 
     duplicateRowNumbersByHash); 

    } 

輸出:

row 2 is a duplicate of rows [2, 4, 8], line content: 1,two,01/02/2013 
row 3 is a duplicate of rows [3, 7], line content: 2,two,01/02/2013 
row 4 is a duplicate of rows [2, 4, 8], line content: 1,two,01/02/2013 
row 7 is a duplicate of rows [3, 7], line content: 2,two,01/02/2013 
row 8 is a duplicate of rows [2, 4, 8], line content: 1,two,01/02/2013