2016-07-03 34 views
0

我是相當新的編程,並希望這一計劃,file1.csvfile2.csv如何將兩個CSV文件之間的「共同」列

輸入之間傳輸的公用列代碼:

file1.csv外觀像這樣:

ID,Nickname,Gender,SubjectPrefix,SubjectFirstName,Whatever1A,Whaterver2A,SubjectLastName 
1,J.,M,Dr.,Jason,,,Allan 
2,B.,M,Mr.,Brian,,,Welch 

file2.csv看起來是這樣的:

nickname,gender,city,id,prefix_name,first_name,Whatever1B,last_name,Whatever2B,Whatever3B,Whatever4B 

問題:

如何比較的file1.csvfile1.csv頭識別,然後將它們之間傳輸的「共同」欄目。 「共同」列有類似的命名約定的,(即IDidNicknamenickname),或不一定具有相同的命名慣例,但存儲相同的數據的,(即SubjectPrefixprefix_name,SubjectFirstNamefirst_name)。

輸出:

輸出應該是這樣的。

  • 注:轉移列"id""nickname""gender"file1.csvfile2.csv標題之間相似的命名的人。並且列"prefix_name""first_name"分別對應於"SubjectPrefix""SubjectFirstName"

    id,nickname,gender,prefix_name,first_name,last_name 
    1,J.,M,Dr.,Jason,Allan 
    2,B.,M,Mr.,Brian,Welch 
    

我試過這段代碼:

import csv 
import collections 

csv_file1 = "file1.csv" 
csv_file2 = "file2.csv" 

data1 = list(csv.reader(file(csv_file1,'r'))) 
data2 = list(csv.reader(file(csv_file2,'r'))) 

file1_header = data1[0][:] #get the header from file1 
file2_header = data2[0][:] #get the header from file2 
lowered_file1_header = [item.lower() for item in file1_header] #lowercase file1 header 
lowered_file2_header = [item.lower() for item in file2_header] #lowercase file2 header anyways 
col_index_dict = {} 

for column in lowered_file1_header: 
    if column == "subjectprefix": # identify "subjectprefix" column in file1.csv 
     col_index_dict[column] = lowered_file1_header.index(column) 

    elif column == "subjectfirstname": # identify "subjectfirstname" column in file1.csv 
     col_index_dict[column] = lowered_file1_header.index(column) 

    elif column in file2_header: # identify the columns with same naming 
     col_index_dict[column] = lowered_file1_header.index(column) 

    else: 
     col_index_dict[column] = -1 # mark the not matching columns 

# Build header 
output = [col_index_dict.keys()] 
is_header = True 

for row in data1: 
    if is_header is False: 
     rowData = [] 
     for column in col_index_dict: 
      column_index = col_index_dict[column] 
      if column_index != -1: 
       rowData.append(row[column_index]) 
      else: 
       rowData.append('') 
     output.append(rowData) 
    else: 
     is_header = False 

print(output) 

任何想法如何如何解決這個問題?

回答

-1

感謝Wboy作出的貢獻,你的投入是非常有用的。

我能夠使用熊貓庫找到問題的解決方案。下面是代碼:

import pandas as pd 

# read the csv files 
df = pd.read_csv('file1.csv') 
df2 = pd.read_csv('file2.csv') 

# lowercase the headers 
df.columns = df.columns.str.lower() 
df2.columns = df2.columns.str.lower() 

df_columns = set(list(df.columns)) 
df2_columns = set(list(df2.columns)) 

識別和傳遞 「共同」 列:

for col in list(df_columns): 
    for col2 in list(df2_columns): 
     if col == "subjectprefix" and col2 =="prefix_name": 
      # copy the data from df["subjectprefix"] column to df2["prefix_name"] column in df2 dataframe 
      df2["prefix_name"] = df['subjectprefix'] 
      df3 = [col2] 
     elif col == "subjectfirstname" and col2 =="first_name": 
      # copy the data from "subjectfirstname" column to "first_name" column 
      df2["first_name"] = df["subjectfirstname"] 
      df3.append(col2) 

     elif col =="subjectlastname" and col2 =="last_name": 
      #copy the data from "subjectfirstname" column to "last_name" column 
      df2["last_name"] = df["subjectlastname"] 
      df3.append(col2) 

     elif col == col2: 
      # copy the exactly matching to df2 
      df2[col2] = df[col] 
      df3.append(col2) 

從數據幀DF2刪除了 「罕見」 列:

for col2 in list(df2_columns): 
if not col2 in df3: 
    del df2[col2] 

# print the output 
df2.set_index("id",inplace=True) 
print df2 

將輸出保存爲.csv文件:

df2.to_csv('output.csv') 

我相信這不是一個最佳解決方案,我希望可以通過識別和傳輸「通用」列來改進代碼。我的代碼充滿了if/elif語句,我相信在這裏必須有更好的方法來實現。

+0

這裏看看這個想法...... set(list(x)is not need ...瞭解用「isin」過濾熊貓的過程。https://people.duke.edu/~ccc14/sta-663/ IntroductionToPythonSolutions.html – Merlin

1

歡迎來到編程。讓我向你介紹令人驚歎的pandas library

關閉我的頭頂,這是解決您的問題的東西。 (我不是說其高效!因此,對於大型數據集這可能是一個問題)

import pandas as pd 

df = pd.read_csv('file1.csv') 
df2 = pd.read_Csv('file2.csv') 

df_columns = set(list(df.columns)) 
df2_columns = set(list(df2.columns)) 

common_columns = list(df_columns.intersection(df2_columns)) 

common_df = df[common_columns] 
common_df2 = df2[common_colmns] 

## At this point you have the common columns for both CSV's. if you want 
## to make them into one, just use df concatenate/append. else, you can save both of them like this: 

common_df.to_csv('common1.csv') 
common_df2.to_csv('common2.csv') 
相關問題