將多個CSV文件與Python進行比較

我在尋找使用Python比較多個CSV文件並輸出報告。要比較的CSV文件的數量會有所不同，所以我將它從目錄中拉出一個列表。每個CSV有2列：第一個是地區代碼和交換，第二個是價格。例如將多個CSV文件與Python進行比較

1201007,0.006 
1201032,0.0119 
1201040,0.0106 
1201200,0.0052 
1201201,0.0345

的文件不會都包含相同的區域代碼和交流，因此而不是由線比較的線，我需要使用的第一個字段作爲鍵。然後我需要生成一個報告：file1有200個與file2不匹配的文件，371個比file2低的價格，以及562個比file2高的價格。我需要生成這個來比較每個文件彼此，所以這一步將對file3，file4 ....，然後file2對files3等重複。我會認爲自己是一個Python的相對noob。下面是我目前使用的代碼，它只抓取目錄中的文件，並打印所有文件中的價格並進行總計。

import csv 
import os 

count = 0 
#dir containing CSV files 
csvdir="tariff_compare" 
dirList=os.listdir(csvdir) 
#index all files for later use 
for idx, fname in enumerate(dirList): 
    print fname 
    dic_read = csv.reader(open(fname)) 
    for row in dic_read: 
     key = row[0] 
     price = row[1] 
     print price 
     count += 1 
print count

來源

2012-06-25 user1480902

這裏假設你所有的數據都可以放在內存中;如果沒有，你將不得不嘗試一次只加載一些文件集，或者一次只加載兩個文件。

它進行比較並將輸出寫入summary.csv文件，每對文件一行。

import csv 
import glob 
import os 
import itertools 

def get_data(fname): 
    """ 
    Load a .csv file 
    Returns a dict of {'exchange':float(price)} 
    """ 
    with open(fname, 'rb') as inf: 
     items = (row.split() for row in csv.reader(inf)) 
     return {item[0]:float(item[1]) for item in items} 

def do_compare(a_name, a_data, b_name, b_data): 
    """ 
    Compare two data files of {'key': float(value)} 

    Returns a list of 
     - the name of the first file 
     - the name of the second file 
     - the number of keys in A which are not in B 
     - the number of keys in B which are not in A 
     - the number of values in A less than the corresponding value in B 
     - the number of values in A equal to the corresponding value in B 
     - the number of values in A greater than the corresponding value in B 
    """ 
    a_keys = set(a_data.iterkeys()) 
    b_keys = set(b_data.iterkeys()) 

    unique_to_a = len(a_keys - b_keys) 
    unique_to_b = len(b_keys - a_keys) 

    lt,eq,gt = 0,0,0 
    pairs = ((a_data[key], b_data[key]) for key in a_keys & b_keys) 
    for ai,bi in pairs: 
     if ai < bi: 
      lt +=1 
     elif ai == bi: 
      eq += 1 
     else: 
      gt += 1 

    return [a_name, b_name, unique_to_a, unique_to_b, lt, eq, gt] 

def main(): 
    os.chdir('d:/tariff_compare') 

    # load data from csv files 
    data = {} 
    for fname in glob.glob("*.csv"): 
     data[fname] = get_data(fname) 

    # do comparison 
    files = data.keys() 
    files.sort() 
    with open('summary.csv', 'wb') as outf: 
     outcsv = csv.writer(outf) 
     outcsv.writerow(["File A", "File B", "Unique to A", "Unique to B", "A<B", "A==B", "A>B"]) 
     for a,b in itertools.combinations(files, 2): 
      outcsv.writerow(do_compare(a, data[a], b, data[b])) 

if __name__=="__main__": 
    main()

編輯： user1277476使得一個好點;如果您通過交換（或者它們已經按照排序順序）對文件進行預先排序，則可以同時遍歷所有文件，除了每條內存中的當前行以外都保留所有文件。

這將讓你做一個更深入的比較，每個交換條目 - 包含一個值，或頂部或底部的N值等

來源

2012-06-25 20:27:34

我會盡快實施，但看起來正是我所需要的。謝謝！ – user1480902

文件的數量如果你的文件很小，你可以做些什麼基本是這樣

data = dict() 
for fname in os.listdir(csvDir): 
    with open(fname, 'rb') as fin: 
     data[fname] = dict((key, value) for key, value in fin.readlines()) 
# All the data is now loaded into your data dictionary 
# data -> {'file1.csv': {1201007: 0.006, 1201032: 0.0119, 1201040: 0.0106}, 'file2.csv': ...}

現在一切都容易獲得，爲您在您的數據字典鍵進行比較和它們的結果值。否則，如果你有更大的數據集來處理那些可能無法在內存中加載的數據集，那麼你可能需要考慮一次只處理兩個文件，一個存儲在內存中。您可以使用itertools.combinations創建一個文件名組合列表，您可以像combinations(filenames, 2)那樣調用這個文件名組合，可以使用您可以使用的獨特組合創建2個文件名對。

從那裏你仍然可以進一步優化，但應該讓你去。

來源

2012-06-25 20:31:12

我會比較它們之前排序文件。然後使用與mergesort的合併步驟類似的算法進行比較。

你仍然需要考慮如何處理重複記錄-EG，如果file1有1234567,0.1兩次，file2又如何？如果file1有3個，file2有5個，反之亦然呢？

http://en.literateprograms.org/Merge_sort_%28Python%29 
http://stromberg.dnsalias.org/~strombrg/sort-comparison/ 
http://en.wikipedia.org/wiki/Merge_sort

來源

2012-06-25 20:55:11 user1277476

他們已經預先分類。至於重複數據，由於數據的類型，在單個文件中絕對沒有重複。 – user1480902

將多個CSV文件與Python進行比較

回答

相關問題