如何將兩個CSV文件之間的「共同」列

我是相當新的編程，並希望這一計劃，file1.csv和file2.csv如何將兩個CSV文件之間的「共同」列

輸入之間傳輸的公用列代碼：

file1.csv外觀像這樣：

ID,Nickname,Gender,SubjectPrefix,SubjectFirstName,Whatever1A,Whaterver2A,SubjectLastName 
1,J.,M,Dr.,Jason,,,Allan 
2,B.,M,Mr.,Brian,,,Welch

file2.csv看起來是這樣的：

nickname,gender,city,id,prefix_name,first_name,Whatever1B,last_name,Whatever2B,Whatever3B,Whatever4B

問題：

如何比較的file1.csv和file1.csv頭識別，然後將它們之間傳輸的「共同」欄目。「共同」列有類似的命名約定的，（即ID和id，Nickname和nickname），或不一定具有相同的命名慣例，但存儲相同的數據的，（即SubjectPrefix和prefix_name,SubjectFirstName和first_name）。

輸出：

輸出應該是這樣的。

注：轉移列"id"，"nickname"和"gender"與file1.csv和file2.csv標題之間相似的命名的人。並且列"prefix_name"和"first_name"分別對應於"SubjectPrefix"和"SubjectFirstName"。
```
id,nickname,gender,prefix_name,first_name,last_name 
1,J.,M,Dr.,Jason,Allan 
2,B.,M,Mr.,Brian,Welch 
```

我試過這段代碼：

import csv 
import collections 

csv_file1 = "file1.csv" 
csv_file2 = "file2.csv" 

data1 = list(csv.reader(file(csv_file1,'r'))) 
data2 = list(csv.reader(file(csv_file2,'r'))) 

file1_header = data1[0][:] #get the header from file1 
file2_header = data2[0][:] #get the header from file2 
lowered_file1_header = [item.lower() for item in file1_header] #lowercase file1 header 
lowered_file2_header = [item.lower() for item in file2_header] #lowercase file2 header anyways 
col_index_dict = {} 

for column in lowered_file1_header: 
    if column == "subjectprefix": # identify "subjectprefix" column in file1.csv 
     col_index_dict[column] = lowered_file1_header.index(column) 

    elif column == "subjectfirstname": # identify "subjectfirstname" column in file1.csv 
     col_index_dict[column] = lowered_file1_header.index(column) 

    elif column in file2_header: # identify the columns with same naming 
     col_index_dict[column] = lowered_file1_header.index(column) 

    else: 
     col_index_dict[column] = -1 # mark the not matching columns 

# Build header 
output = [col_index_dict.keys()] 
is_header = True 

for row in data1: 
    if is_header is False: 
     rowData = [] 
     for column in col_index_dict: 
      column_index = col_index_dict[column] 
      if column_index != -1: 
       rowData.append(row[column_index]) 
      else: 
       rowData.append('') 
     output.append(rowData) 
    else: 
     is_header = False 

print(output)

任何想法如何如何解決這個問題？

來源

2016-07-03 MEhsan

-1

感謝Wboy作出的貢獻，你的投入是非常有用的。

我能夠使用熊貓庫找到問題的解決方案。下面是代碼：

import pandas as pd 

# read the csv files 
df = pd.read_csv('file1.csv') 
df2 = pd.read_csv('file2.csv') 

# lowercase the headers 
df.columns = df.columns.str.lower() 
df2.columns = df2.columns.str.lower() 

df_columns = set(list(df.columns)) 
df2_columns = set(list(df2.columns))

識別和傳遞「共同」列：

for col in list(df_columns): 
    for col2 in list(df2_columns): 
     if col == "subjectprefix" and col2 =="prefix_name": 
      # copy the data from df["subjectprefix"] column to df2["prefix_name"] column in df2 dataframe 
      df2["prefix_name"] = df['subjectprefix'] 
      df3 = [col2] 
     elif col == "subjectfirstname" and col2 =="first_name": 
      # copy the data from "subjectfirstname" column to "first_name" column 
      df2["first_name"] = df["subjectfirstname"] 
      df3.append(col2) 

     elif col =="subjectlastname" and col2 =="last_name": 
      #copy the data from "subjectfirstname" column to "last_name" column 
      df2["last_name"] = df["subjectlastname"] 
      df3.append(col2) 

     elif col == col2: 
      # copy the exactly matching to df2 
      df2[col2] = df[col] 
      df3.append(col2)

從數據幀DF2刪除了「罕見」列：

for col2 in list(df2_columns): 
if not col2 in df3: 
    del df2[col2] 

# print the output 
df2.set_index("id",inplace=True) 
print df2

將輸出保存爲.csv文件：

df2.to_csv('output.csv')

我相信這不是一個最佳解決方案，我希望可以通過識別和傳輸「通用」列來改進代碼。我的代碼充滿了if/elif語句，我相信在這裏必須有更好的方法來實現。

來源

2016-07-03 23:51:42 MEhsan

這裏看看這個想法...... set（list（x）is not need ...瞭解用「isin」過濾熊貓的過程。https://people.duke.edu/~ccc14/sta-663/ IntroductionToPythonSolutions.html – Merlin

歡迎來到編程。讓我向你介紹令人驚歎的pandas library。

關閉我的頭頂，這是解決您的問題的東西。（我不是說其高效！因此，對於大型數據集這可能是一個問題）

import pandas as pd 

df = pd.read_csv('file1.csv') 
df2 = pd.read_Csv('file2.csv') 

df_columns = set(list(df.columns)) 
df2_columns = set(list(df2.columns)) 

common_columns = list(df_columns.intersection(df2_columns)) 

common_df = df[common_columns] 
common_df2 = df2[common_colmns] 

## At this point you have the common columns for both CSV's. if you want 
## to make them into one, just use df concatenate/append. else, you can save both of them like this: 

common_df.to_csv('common1.csv') 
common_df2.to_csv('common2.csv')

來源

2016-07-03 17:41:28 Wboy

如何將兩個CSV文件之間的「共同」列

回答

相關問題