2017-04-25 41 views
4

我正在嘗試計算CSV文件列中的重複值並將值返回給python中的另一個CSV列。在CSV文件的特定列中計數重複值並將該值返回到另一列(python2)

例如,我的CSV文件:

KeyID GeneralID 
145258 KL456 
145259 BG486 
145260 HJ789 
145261 KL456 

我想實現的是計算有多少數據具有相同的GeneralID並將其插入新的CSV列。例如,

KeyID Total_GeneralID 
145258 2 
145259 1 
145260 1 
145261 2 

我試圖使用拆分方法拆分每列,但它不能很好地工作。

我的代碼:

case_id_list_data = [] 

with open(file_path_1, "rU") as g: 
    for line in g: 
     case_id_list_data.append(line.split('\t')) 
     #print case_id_list_data[0][0] #the result is dissatisfying 
     #I'm stuck here.. 

回答

1

而如果你是不利的大熊貓,並希望留在標準庫:

代碼:

import csv 
from collections import Counter 
with open('file1', 'rU') as f: 
    reader = csv.reader(f, delimiter='\t') 
    header = next(reader) 
    lines = [line for line in reader] 
    counts = Counter([l[1] for l in lines]) 

new_lines = [l + [str(counts[l[1]])] for l in lines] 
with open('file2', 'wb') as f: 
    writer = csv.writer(f, delimiter='\t') 
    writer.writerow(header + ['Total_GeneralID']) 
    writer.writerows(new_lines) 

結果:

KeyID GeneralID Total_GeneralID 
145258 KL456 2 
145259 BG486 1 
145260 HJ789 1 
145261 KL456 2 
+0

你用什麼python版本導入集合庫?我正在使用python v 2.6.6,並且出現錯誤 'from collections import Counter' 'ImportError:無法導入名稱計數器' – yunaranyancat

+1

計數器爲2.7+,但您可以在此獲取源代碼:http:// code .activestate.com/recipes/576611-counter-class/ –

1
import pandas as pd 
#read your csv to a dataframe 
df = pd.read_csv('file_path_1') 
#generate the Total_GeneralID by counting the values in the GeneralID column and extract the occurrance for the current row. 
df['Total_GeneralID'] = df.GeneralID.apply(lambda x: df.GeneralID.value_counts()[x]) 
df = df[['KeyID','Total_GeneralID']] 
Out[442]: 
    KeyID Total_GeneralID 
0 145258    2 
1 145259    1 
2 145260    1 
3 145261    2 
1

您可以使用​​庫:


import pandas as pd 

df = pd.read_csv('file') 
s = df['GeneralID'].value_counts().rename('Total_GeneralID') 
df = df.join(s, on='GeneralID') 
print (df) 
    KeyID GeneralID Total_GeneralID 
0 145258  KL456    2 
1 145259  BG486    1 
2 145260  HJ789    1 
3 145261  KL456    2 
3

你有三個步驟來劃分任務: 1.閱讀CSV文件 2.生成新列的值 3.添加值迴文件 導入CSV 進口的FileInput 進口SYS

# 1. Read CSV file 
# This is opening CSV and reading value from it. 
with open("dev.csv") as filein: 
    reader = csv.reader(filein, skipinitialspace = True) 
    xs, ys = zip(*reader) 

result=["Total_GeneralID"] 

# 2. Generate new column's value 
# This loop is for counting the "GeneralID" element. 
for i in range(1,len(ys),1): 
    result.append(ys.count(ys[i])) 

# 3. Add value to the file back 
# This loop is for writing new column 
for ind,line in enumerate(fileinput.input("dev.csv",inplace=True)): 
    sys.stdout.write("{} {}, {}\n".format("",line.rstrip(),result[ind])) 

我沒有使用臨時文件或任何高級模塊,如熊貓或任何東西。

+0

你能爲csv.DictReader顯示另一種方法嗎?或者它是一樣的東西嗎?'xs,ys = zip(* reader)'是做什麼的? – yunaranyancat

+1

Zip()返回一個元組列表,其中每個元組包含來自每個參數序列的第i個元素 。 –

0

使用csv.reader而不是split()方法。 它更容易。

謝謝

相關問題