在python中有效地組合類似的CSV行

我想將非常大的csv文件中的相似行（每個近1GB）組合成一個。我感興趣的是做這樣的事情：在python中有效地組合類似的CSV行

以前

First Name | Last Name | Phone Number | Email 

John  | Doe  | 1234   | [email protected] 
Jane  | Doe  | 4321   | [email protected] 
John  | Doe  | 6789   | [email protected] 
Jane  | Doe  | 9876   | [email protected]

後

First Name | Last Name | Phone Number | Email 

John  | Doe  | 1234, 6789 | [email protected], [email protected] 
Jane  | Doe  | 4321, 9876 | [email protected], [email protected]

也就是說，使用名和姓，和手機結合的行和電子郵件將它們添加到「列表」。

感謝

來源

2017-08-25 Triple Nipple

如果你有一個問題標記大數據，你可能不應該使用itertools。 –

我應該使用什麼？ –

任何大數據或數據處理工具... numpy ...熊貓...火花... hadoop ...等 –

要在您的CSV文件中讀取，你需要pd.read_csv：

df = pd.read_csv('file.csv', delimiter='|', sep='\s+')

你會在First Name和Last Name調用df.groupby然後dfGroupBy.agg加盟：

print(df) 

    First Name Last Name Phone Number   Email 
0 John   Doe     1234  [email protected] 
1 Jane   Doe     4321  [email protected] 
2 John   Doe     6789 [email protected] 
3 Jane   Doe     9876 [email protected] 


out = df.astype(str).groupby(['First Name', 'Last Name']).agg(', '.join) 
print(out) 

         Phone Number       Email 
First Name Last Name            
Jane   Doe   4321, 9876 [email protected], [email protected] 
John   Doe   1234, 6789 [email protected], [email protected]

如果你想重置索引，你可以這樣做，使用df.reset_index：

out = out.reset_index() 
print(out) 

    First Name Last Name Phone Number       Email 
0 Jane   Doe   4321, 9876 [email protected], [email protected] 
1 John   Doe   1234, 6789 [email protected], [email protected]

保存到csv很簡單，您將使用out.to_csv('file.csv')。

附錄：丟棄重複

out = df.astype(str).groupby(['First Name', 'Last Name'])\ 
       .agg(lambda x: ', '.join(x.drop_duplicates().values))

來源

2017-08-25 17:08:05

偉大的工作保持答案有限，易於閱讀和簡單（+1） –

謝謝！有效！對於如何嵌入相同的代碼，使用相同的代碼刪除「電話」或「電子郵件」列中的重複內容，您有任何想法嗎？而不是像「1234,1234,1234,6789」這樣的「電話」列中的值，我會有「1234,6789」？謝謝！ –

@TripleNipple解決方案是使用'drop_duplicates'。檢查我的編輯。 –

爲CSV文件看起來像這樣（有位格式化刪除uncessary空格）：

First Name|Last Name|Phone Number|Email 
John|Doe|1234|[email protected] 
Jane|Doe|4321|[email protected] 
John|Doe|6789|[email protected] 
Jane|Doe|9876|[email protected]

您可以使用大熊貓如下，結合相似的列（基於名字和姓氏）：

import pandas as pd 

df = pd.read_csv("/tmp/test.csv", sep="|") 
df_combined = df.groupby(["First Name", "Last Name"], as_index=False).agg({"Phone Number":lambda x: ', '.join(str(i) for i in list(x)), "Email": lambda x: ', '.join(str(i) for i in list(x))}) 
df_combined.to_csv("/tmp/combined_data.csv", sep="|", index=False)

輸出文件看起來是這樣的：

First Name|Last Name|Phone Number|Email 
Jane|Doe|4321, 9876|[email protected], [email protected] 
John|Doe|1234, 6789|[email protected], [email protected]

來源

2017-08-25 17:09:22 MedAli

謝謝你的努力！ –

在python中有效地組合類似的CSV行

回答

相關問題