2016-10-24 84 views
2

我正在研究從2個大型csv文件(逐行讀取數據)中讀取數據的「程序」,比較文件中的數組元素,並在找到匹配項時寫入我的必要的數據放入第三個文件。我遇到的唯一問題是它非常緩慢。它讀取每秒1-2行,這是非常緩慢的,考慮到我有數百萬條記錄。關於如何讓它更快的任何想法?這裏是我的代碼:優化CSV解析速度更快

 public class ReadWriteCsv { 

public static void main(String[] args) throws IOException { 

    FileInputStream inputStream = null; 
    FileInputStream inputStream2 = null; 
    Scanner sc = null; 
    Scanner sc2 = null; 
    String csvSeparator = ","; 
    String line; 
    String line2; 
    String path = "D:/test1.csv"; 
    String path2 = "D:/test2.csv"; 
    String path3 = "D:/newResults.csv"; 
    String[] columns; 
    String[] columns2; 
    Boolean matchFound = false; 
    int count = 0; 
    StringBuilder builder = new StringBuilder(); 

    FileWriter writer = new FileWriter(path3); 

    try { 
     // specifies where to take the files from 
     inputStream = new FileInputStream(path); 
     inputStream2 = new FileInputStream(path2); 

     // creating scanners for files 
     sc = new Scanner(inputStream, "UTF-8"); 

     // while there is another line available do: 
     while (sc.hasNextLine()) { 
      count++; 
      // storing the current line in the temporary variable "line" 
      line = sc.nextLine(); 
      System.out.println("Number of lines read so far: " + count); 
      // defines the columns[] as the line being split by "," 
      columns = line.split(","); 
      inputStream2 = new FileInputStream(path2); 
      sc2 = new Scanner(inputStream2, "UTF-8"); 

      // checks if there is a line available in File2 and goes in the 
      // while loop, reading file2 
      while (!matchFound && sc2.hasNextLine()) { 
       line2 = sc2.nextLine(); 
       columns2 = line2.split(","); 

       if (columns[3].equals(columns2[1])) { 
        matchFound = true; 
        builder.append(columns[3]).append(csvSeparator); 
        builder.append(columns[1]).append(csvSeparator); 
        builder.append(columns2[2]).append(csvSeparator); 
        builder.append(columns2[3]).append("\n"); 
        String result = builder.toString(); 
        writer.write(result); 
       } 

      } 
      builder.setLength(0); 
      sc2.close(); 
      matchFound = false; 
     } 

     if (sc.ioException() != null) { 
      throw sc.ioException(); 

     } 

    } finally { 
     //then I close my inputStreams, scanners and writer 
+0

看起來你正在重讀第一行中每行的整個第二個文件。 *當然*這對大文件來說會很慢。 – azurefrog

+0

你能適應這兩個文件的內存?如果是這樣,只需讀取並加載到內存中的數據結構(數組,列表等)。與內存操作相比,IO操作非常昂貴。 – Yuri

+0

@azurefrog我怎麼能這樣做呢?新編程,對不起 - – Noobinator

回答

1

使用現有的CSV庫,而不是滾動自己的。它會比現在更強大。

但是,您的問題不是CSV解析速度,它的算法是O(n^2),對於第一個文件中的每一行,您需要掃描第二個文件。這種算法在數據量很大的情況下爆炸很快,當你有數百萬行時,你會遇到問題。你需要一個更好的算法。

另一個問題是你重新解析每個掃描的第二個文件。您至少應該在程序開始時將其作爲ArrayList或其他東西讀取到內存中,這樣您只需加載並解析一次即可。

0

使用univocity-parsers「CSV解析器,因爲它不會花費更長的時間超過幾秒鐘來處理是分別用1萬行兩個文件:

public void diff(File leftInput, File rightInput) { 
    CsvParserSettings settings = new CsvParserSettings(); //many config options here, check the tutorial 

    CsvParser leftParser = new CsvParser(settings); 
    CsvParser rightParser = new CsvParser(settings); 

    leftParser.beginParsing(leftInput); 
    rightParser.beginParsing(rightInput); 

    String[] left; 
    String[] right; 

    int row = 0; 
    while ((left = leftParser.parseNext()) != null && (right = rightParser.parseNext()) != null) { 
     row++; 
     if (!Arrays.equals(left, right)) { 
      System.out.println(row + ":\t" + Arrays.toString(left) + " != " + Arrays.toString(right)); 
     } 
    } 

    leftParser.stopParsing(); 
    rightParser.stopParsing(); 
} 

披露:我是這個庫的作者。它是開放源代碼和免費的(Apache V2.0許可證)。