如何合併具有重複值的列並保留Python中不同列的最大值？

-1

我想查找「參考」列的重複值，然後保留僅複製來自「金額」列的最大金額列的找到的行。如何合併具有重複值的列並保留Python中不同列的最大值？

電流：

+----------+---------------------+---------+ 
| reference | amount | column3 | column4 | 
+----------+---------------------+---------+ 
| test1 |  9 |  45 | ye  | 
| test1 |  200|  45 | agag | 
| test1 |  1 |  45 | aaa  | 
| test2 |  99 |  45 | bbab | 
| test1 |  11 |  45 | value | 
+----------+---------------------+----------+

期望：

+----------+---------------------+---------+ 
| reference | amount | column3 | column4 | 
+----------+---------------------+---------+ 
| test1 |  200|  45 | agag | 
| test2 |  99 |  45 | bbab |

請分享對這種情況的線索。

來源

2015-07-12 serte

什麼你的數據格式和你到目前爲止做了什麼？ – Kasramvd

請告知您正在使用哪種數據類型。你基本上可以使用group by，並從每個組中找到最大值。 – vdkotian

這是一個csv文件。我試圖找到重複的行。我會繼續挖 – serte

類似以下內容將是一個良好的開端：

import csv, collections 

with open("mydata.csv", 'r') as f_input: 
    csv_input = csv.reader(f_input) 
    # Assuming the first row contains the heading names, otherwise remove. 
    headings = csv_input.next()  
    d_max_rows = collections.OrderedDict() 

    for cols in csv_input: 
     reference = cols[0] 
     if reference in d_max_rows: 
      cur_max = d_max_rows[reference] 
      if int(cols[1]) >= int(cur_max[1]): 
       d_max_rows[reference] = cols 
     else: 
      d_max_rows[reference] = cols 

lrows = [headings] + list(d_max_rows.itervalues()) 

for reference, amount, col3, col4 in lrows: 
    print "%-15s %-10s %-10s %-10s" % (reference, amount, col3, col4)

這會給你以下的輸出：

reference  amount  column3 column4 
test1   200  45   agag  
test2   99   45   bbab

來源

2015-07-12 14:57:38

@ Martin Evans它的工作原理。謝謝。 – serte

這是個好消息。不要忘記在投票結束時對任何有用的回覆投票並接受您的首選答案。 –

下面是一些代碼，你想要做什麼：

from collections import namedtuple 
import csv 

Record = namedtuple('Record', 'reference amount column3 column4') 

no_dups = {} 
with open('references.csv', 'r', newline='') as csvfile: 
    for rec in map(Record._make, csv.reader(csvfile)): 
     if (rec.reference not in no_dups or 
      int(no_dups[rec.reference].amount) < int(rec.amount)): 
      no_dups[rec.reference] = rec 

with open('references_out.csv', 'w', newline='') as csvfile: 
    csv.writer(csvfile).writerows(rec for rec in no_dups.values())

來源

2015-07-12 15:20:01 martineau

熊貓是一個非常棒的python模塊，用於處理表格數據。它非常像R語言，並提供了一種內存數據庫。爲了您的例子是這樣簡單：

import pandas as pd 

df = pd.read_csv('test.csv') 
a = df.groupby('reference')[['amount']].max() 
answer = df.merge(a, on='amount')

並將結果保存回CSV：

answer.to_csv('out.csv', index=False)

假設test.csv是您的數據文件，像這樣：

reference,amount,column3,column4 
test1,9,45,ye 
test1,200,45,agag 
test1,1,45,aaa 
test2,99,45,bbab 
test1,11,45,value

來源

2015-07-12 15:39:30 fivetentaylor

如何合併具有重複值的列並保留Python中不同列的最大值？

回答

相關問題