2015-07-12 58 views
-1

我想查找「參考」列的重複值,然後保留僅複製來自「金額」列的最大金額列的找到的行。如何合併具有重複值的列並保留Python中不同列的最大值?

電流:

+----------+---------------------+---------+ 
| reference | amount | column3 | column4 | 
+----------+---------------------+---------+ 
| test1 |  9 |  45 | ye  | 
| test1 |  200|  45 | agag | 
| test1 |  1 |  45 | aaa  | 
| test2 |  99 |  45 | bbab | 
| test1 |  11 |  45 | value | 
+----------+---------------------+----------+ 

期望:

+----------+---------------------+---------+ 
| reference | amount | column3 | column4 | 
+----------+---------------------+---------+ 
| test1 |  200|  45 | agag | 
| test2 |  99 |  45 | bbab | 

請分享對這種情況的線索。

+2

什麼你的數據格式和你到目前爲止做了什麼? – Kasramvd

+0

請告知您正在使用哪種數據類型。你基本上可以使用group by,並從每個組中找到最大值。 – vdkotian

+0

這是一個csv文件。我試圖找到重複的行。我會繼續挖 – serte

回答

0

類似以下內容將是一個良好的開端:

import csv, collections 

with open("mydata.csv", 'r') as f_input: 
    csv_input = csv.reader(f_input) 
    # Assuming the first row contains the heading names, otherwise remove. 
    headings = csv_input.next()  
    d_max_rows = collections.OrderedDict() 

    for cols in csv_input: 
     reference = cols[0] 
     if reference in d_max_rows: 
      cur_max = d_max_rows[reference] 
      if int(cols[1]) >= int(cur_max[1]): 
       d_max_rows[reference] = cols 
     else: 
      d_max_rows[reference] = cols 

lrows = [headings] + list(d_max_rows.itervalues()) 

for reference, amount, col3, col4 in lrows: 
    print "%-15s %-10s %-10s %-10s" % (reference, amount, col3, col4) 

這會給你以下的輸出:

reference  amount  column3 column4 
test1   200  45   agag  
test2   99   45   bbab 
+0

@ Martin Evans它的工作原理。謝謝。 – serte

+0

這是個好消息。不要忘記在投票結束時對任何有用的回覆投票並接受您的首選答案。 –

0

下面是一些代碼,你想要做什麼:

from collections import namedtuple 
import csv 

Record = namedtuple('Record', 'reference amount column3 column4') 

no_dups = {} 
with open('references.csv', 'r', newline='') as csvfile: 
    for rec in map(Record._make, csv.reader(csvfile)): 
     if (rec.reference not in no_dups or 
      int(no_dups[rec.reference].amount) < int(rec.amount)): 
      no_dups[rec.reference] = rec 

with open('references_out.csv', 'w', newline='') as csvfile: 
    csv.writer(csvfile).writerows(rec for rec in no_dups.values()) 
0

熊貓是一個非常棒的python模塊,用於處理表格數據。它非常像R語言,並提供了一種內存數據庫。爲了您的例子是這樣簡單:

import pandas as pd 

df = pd.read_csv('test.csv') 
a = df.groupby('reference')[['amount']].max() 
answer = df.merge(a, on='amount') 

並將結果保存回CSV:

answer.to_csv('out.csv', index=False) 

假設test.csv是您的數據文件,像這樣:

reference,amount,column3,column4 
test1,9,45,ye 
test1,200,45,agag 
test1,1,45,aaa 
test2,99,45,bbab 
test1,11,45,value 
相關問題