我有一個CSV文件是這樣的：查找重複行，最大數據

Date of event  Name  Date of birth 
06.01.1986   John Smit 23.08.1996 
18.12.1996   Barbara D 01.08.1965 
12.12.2001   Barbara D 01.08.1965 
17.10.1994   John Snow 20.07.1965

我必須找到「名稱」和「出生日期」唯一行（可能與其它一些列），但與MAX日期。

所以我得csv文件是這樣的：

Date of event  Name  Date of birth 
06.01.1986   John Smit 23.08.1996 
12.12.2001   Barbara D 01.08.1965 
17.10.1994   John Snow 20.07.1965

如何做到這一點？我不有任何想法..

來源

2017-08-30 Alexandr Lebedev

'找到獨特rows'或'找到一個重複的row'？ –

找到唯一的行，我也需要將這個解決方案與源列結合...並寫入csv –

與源結合意味着什麼？唯一的來源是源，如果與非唯一結合使用，結果是污染。 –

格式化

由於列名有空格，最好用逗號分隔。

算法

可以使用熊貓庫做到這一點：

import tempfile 
import pandas 

# create a temporary csv file with your data (comma delimited) 
temp_file_name = None 
with tempfile.NamedTemporaryFile('w', delete=False) as f: 
    f.write("""Date of event,Name,Date of birth 
06.01.1986,John Smit,23.08.1996 
18.12.1996,Barbara D,01.08.1965 
12.12.2001,Barbara D,01.08.1965 
17.10.1994,John Snow,20.07.1965""") 
    temp_file_name = f.name 

# read the csv data using the pandas library, specify columns with dates 
data_frame = pandas.read_csv(
    temp_file_name, 
    parse_dates=[0,2], 
    dayfirst=True, 
    delimiter=',' 
) 

# use groupby and max to do the magic 
unique_rows = data_frame.groupby(['Name','Date of birth']).max() 

# write the results 
result_csv_file_name = None 
with tempfile.NamedTemporaryFile('w', delete=False) as f: 
    result_csv_file_name = f.name 
    unique_rows.to_csv(f) 

# read and show the results 
with open(result_csv_file_name, 'r') as f: 
    print(f.read())

這導致：

Name,Date of birth,Date of event 
Barbara D,1965-08-01,2001-12-12 
John Smit,1996-08-23,1986-01-06 
John Snow,1965-07-20,1994-10-17

來源

2017-08-30 04:31:07

但是如果我想寫這個結果，我該怎麼辦？我需要將csv按最大日期與源csv的所有列分組。 –

@AlexandrLebedev我更新了我的答案，也寫出了csv。你應該真的只是用谷歌來查找一些文檔。 https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html –

import pandas as pd 

# read the csv in with pandas module 

df = pd.read_csv('pathToCsv.csv', header=0, parse_dates=[0, 2]) 

# set the column names as more programming friendly i.e. no whitespace 

df.columns = ['dateOfEvent','name','DOB'] # and probably some other columns .. 

# keep row only with max (Date of event) per group (name, Date of Birth) 

yourwish = =df.groupby(['Name','DOB'])['dateOfEvent'].max()

來源

2017-08-30 03:25:56 yukclam9

非常感謝，它幫助我找到這一行，但我也需要結果與源csv-列 –

什麼這意味着 – yukclam9

查找重複行，最大數據

回答

格式化

算法

相關問題