2017-06-01 94 views
0

我需要匹配來自多個CSV文件的數據。我編寫了一個簡單數據的腳本,但分析4000行的速度很慢。 我已經使用set(a) & set(b)進行了嘗試,但無法從每個文件中返回匹配數據。 輸出文件必須具有來自所有文件的匹配數據。Python:使用來自多個CSV文件的匹配數據寫入CSV

腳本:

for file_1 in files: 
     with open(file_1, 'rt') as f1,open(saved_file, 'w') as f3: 
      reader1 = csv.reader(f1, delimiter = ';') 
      writer = csv.writer(f3, delimiter = ';', lineterminator = '\n') 

      for row1 in reader1: 
       for row1 in reader1: 
        for file_2 in files: 
         with open(file_2, 'rt') as f2: 
          reader2 = csv.reader(f2, delimiter = ';') 
          if row1 in reader2: 
           writer.writerow(row1) 

我想匹配的數據是這樣的:

File_1:

May 22, 2017;12,615.50;12,650.50;12,665.00;12,567.00;-;-0.18% 
May 19, 2017;12,638.69;12,612.30;12,658.55;12,596.72;121.95M;0.39% 
May 18, 2017;12,590.06;12,608.19;12,634.26;12,489.95;123.48M;-0.33% 
May 17, 2017;12,631.61;12,700.12;12,786.89;12,587.45;108.95M;-1.35% 
May 15, 2017;12,807.04;12,824.05;12,832.29;12,729.49;87.08M;0.29% 

File_2:

May 22, 2017;1.1238;1.1200;1.1265;1.1160;0.28% 
May 19, 2017;1.1207;1.1100;1.1214;1.1094;0.94% 
May 17, 2017;1.1159;1.1082;1.1163;1.1078;0.69% 
May 16, 2017;1.1082;1.0975;1.1098;1.0971;0.97% 
May 15, 2017;1.0975;1.0924;1.0991;1.0920;0.40% 

輸出: 保存d_file_1:

May 22, 2017;12,615.50;12,650.50;12,665.00;12,567.00;-;-0.18% 
May 19, 2017;12,638.69;12,612.30;12,658.55;12,596.72;121.95M;0.39% 
May 17, 2017;12,631.61;12,700.12;12,786.89;12,587.45;108.95M;-1.35% 
May 15, 2017;12,807.04;12,824.05;12,832.29;12,729.49;87.08M;0.29% 

saved_file_2:

May 22, 2017;1.1238;1.1200;1.1265;1.1160;0.28% 
May 19, 2017;1.1207;1.1100;1.1214;1.1094;0.94% 
May 17, 2017;1.1159;1.1082;1.1163;1.1078;0.69% 
May 15, 2017;1.0975;1.0924;1.0991;1.0920;0.40% 

回答

0

不訴諸熊貓,你可以做到這一點,這可能是你所想的。

首先瀏覽每個文件,只收集單獨列表中的日期。然後找到這些列表的交集,作爲集合處理。現在再次遍歷每個文件,寫出每個記錄的日期在交集中。

def get_dates(one_file): 
    one_file_dates = [] 
    with open(one_file) as the_file: 
     for line in the_file.readlines(): 
      the_date = line[:line.find(';')] 
      if not the_date in one_file_dates: 
       one_file_dates.append(the_date) 
    return one_file_dates 

common_dates = set(get_dates('file_1.csv')).intersection(set(get_dates('file_2.csv'))) 

print ('*** processing file_1') 
with open('file_1.csv') as the_file: 
    for line in the_file.readlines(): 
     if line[:line.find(';')] in common_dates: 
      print(line.strip()) 

print ('*** processing file_2') 
with open('file_2.csv') as the_file: 
    for line in the_file.readlines(): 
     if line[:line.find(';')] in common_dates: 
      print(line.strip()) 

結果:

*** processing file_1 
May 22, 2017;12,615.50;12,650.50;12,665.00;12,567.00;-;-0.18% 
May 19, 2017;12,638.69;12,612.30;12,658.55;12,596.72;121.95M;0.39% 
May 17, 2017;12,631.61;12,700.12;12,786.89;12,587.45;108.95M;-1.35% 
May 15, 2017;12,807.04;12,824.05;12,832.29;12,729.49;87.08M;0.29% 
*** processing file_2 
May 22, 2017;1.1238;1.1200;1.1265;1.1160;0.28% 
May 19, 2017;1.1207;1.1100;1.1214;1.1094;0.94% 
May 17, 2017;1.1159;1.1082;1.1163;1.1078;0.69% 
May 15, 2017;1.0975;1.0924;1.0991;1.0920;0.40% 

編輯:響應評論新代碼。

def get_dates(one_file): 
    one_file_dates = [] 
    with open(one_file) as the_file: 
     for line in the_file.readlines(): 
      the_date = line[:line.find(';')] 
      if not the_date in one_file_dates: 
       one_file_dates.append(the_date) 
    return one_file_dates 

file_list = ['file_1.csv', 'file_2.csv'] # add more file names here 

common_dates = set(get_dates(file_list[0])) 
for file in file_list[1:]: 
    common_dates = common_dates.intersection(set(get_dates(file))) 

for file in file_list: 
    print ('*** processing ', file) 
    with open(file) as the_file: 
     for line in the_file.readlines(): 
      if line[:line.find(';')] in common_dates: 
       print(line.strip()) 
+0

它適用於2個文件,但我需要搜索文件列表並匹配每個文件。 –

+0

請參閱編輯。 –

0

的性能問題,是因爲你有多個文件讀取器/寫入工作中一個for循環。

我建議你首先使用Pandas將數據從File_1和File_2導入到數據框架中。你可以那樣做:

import pandas as pd 
df1=pd.read_csv("file_1.csv") 
df2=pd.read_csv("file_2.csv") 

那麼你可以申請你計算過導入的數據,你可以再次將其保存到CSV這樣的:

dfOut.to_csv(file_name, sep='\t') 

你需要採取的正確的護理CSV分隔符在這裏。

+0

閱讀文件真的很快。如何比較兩個文件中的行並將行保存到csv文件? –