2017-04-21 82 views
0

我需要對多個文件進行大熊貓DF操作,大熊貓幾個文件操作和合並

df1 = pd.read_csv("~/pathtofile/sample1.csv") 
some_df=pd.read_csv("~/pathtofile/metainfo.csv") 
df1.sort_values('col2') 
df1 = df1[df1.col5 != 'N'] 
df1['new_col'] = df1['col3'] - df1['col2'] + 1 
f = lambda row: '{col1}:{col2}-{col3}({col4})'.format(**row) 
df1.astype(str).apply(f,1) 
df4 = df1.assign(Unique=df1.astype(str).apply(f,1)) 
# print(df4) 
##merge columns 
df44 = df4.merge(some_df, left_on='genes', right_on='name', suffixes=('','_1')) 
df44 = df44.rename(columns={'id':'id_new'}).drop(['name_1'], axis=1) 
# print(df44) 
df44['some_col'] = df44['some_col'] + ':E' + 
df44.groupby('some_col').cumcount().add(1).astype(str).str.zfill(3) 
print(df44) 
##drop unwanted columns adapted from http://stackoverflow.com/questions/13411544/delete-column-from-pandas-dataframe 
df4 = df44 
df4.drop(df4.columns[[3,7,9,11,12,13]], axis=1, inplace=True) 

df4 = df4[['col0', 'col1', 'col2', 'col4', 'col5', 'col6', 'col8']] 
df4 
df4.to_csv('foo.csv', index=False) 

上面的代碼僅僅是一個文件,幾個問題 1)我有〜15個文件,我需要執行這組如何使用這對所有的15個文件 2)和命令的寫入15個不同的CSV 3)合併某些列從所有15 DF,並作出矩陣(例如只是合併3個DFS)

sample1 = pd.DataFrame.set_index(df4,['col1'])["col4"] 
sample2 = pd.DataFrame.set_index(df5,['col1'])["col4"] 
sample3 = pd.DataFrame.set_index(df6, ['col1'])["col4"] 
concat = pd.concat([sample1,sample2,sample3], axis=1).fillna(0) 
# print(concat) 
concat.reset_index(level=0, inplace=True) 
concat.columns = ["newcol0", "col1", "col2", "col3"] 
concat.to_csv('bar.csv', index=False) 

有沒有更好的w唉,要做到這一點,比複製粘貼15次?

+0

是,做一個腳本,並推廣你的操作到功能 –

+0

喜@DmitryPolonskiy請你展示瞭如何做到這一點片段? – novicebioinforesearcher

+0

你不知道如何編寫腳本? –

回答

1

好吧,我只是很快把它放在一起爲上述代碼。我會建議學習如何編寫腳本和概括事物。我沒有清理代碼或解決冗餘問題,我會把它留給你。如果您發佈的代碼有效,這應該從命令行起作用。

import sys 
import pandas as pd 

def load_df(input_file): 
    df = pd.DataFrame(pd.read_csv(input_file)) 
    return df 

def perform_operations(df): 
    df.sort_values('col2') 
    df = df[df.col5 != 'N'] 
    df['new_col'] = df['col3'] - df['col2'] + 1 
    f = lambda row: '{col1}:{col2}-{col3}({col4})'.format(**row) 
    df.astype(str).apply(f,1) 
    df4 = df.assign(Unique=df.astype(str).apply(f,1)) 
    return df4 

def merge_stuff(df, df1): 
    df44 = df.merge(df1, left_on='genes', right_on='name', suffixes=('','_1')) 
    df44 = df44.rename(columns={'id':'id_new'}).drop(['name_1'], axis=1) 
    return df44 


def group_and_drop(df): 
    df['some_col'] = df['some_col'] + ':E' + 
    df.groupby('some_col').cumcount().add(1).astype(str).str.zfill(3) 
    df4 = df 
    df4.drop(df4.columns[[3,7,9,11,12,13]], axis=1, inplace=True) 
    return df4 

def write_out_csv(df): 
    df = df[['col0', 'col1', 'col2', 'col4', 'col5', 'col6', 'col8']] 
    df.to_csv('foo.csv', index=False) 


def main(): 
    file_1 = sys.argv[1] 
    file_2 = sys.argv[2] 
    df = load_df(file_1) 
    df1 = load_df(file_2) 
    df4 = perform_operations(df) 
    df44 = merge_stuff(df4, df1) 
    grouped = group_and_drop(df44) 
    write_out_csv(grouped) 

if __name__ == '__main__': 
    main() 
+0

感謝您的幫助,將在此工作,並學習...非常感謝 – novicebioinforesearcher

+1

如果你不知道它是如何工作的,從命令行你會做這樣的事情'python name_of_script.py location_of_first_csv location_of_second_csv' –