python熊貓通過刪除重複項來加入動態列

我們有一個用例，我們需要通過刪除重複項來連接一行中的所有列值.Data存儲在熊貓的數據框中。對於例如考慮數據幀DF以下與列A，B，Cpython熊貓通過刪除重複項來加入動態列

A B C 
X1 AX X1 
X2 X2 X1 
X3 X3 X3 
X4 XX XX

我想其中串接甲一個新列B添加到C和刪除重複如果任何通過保留的順序找到。輸出將如

A B C Newcol 

X1 AX X1 X1_AX 
X2 X2 X1 X2_X1 
X3 X3 X3 X3 
X4 XX XX X4_XX

請注意，列數是動態的。截至目前，我通過使用命令

df.apply(lambda x: '-'.join(x.dropna().astype(str).drop_duplicates()),axis=1)

這樣做，但這樣很慢，需要大約150秒，我的數據。但由於90％以上的數據幀的通常只有2列，我把一個if語句在我的代碼和2列

t1=pd.Series(np.where(df.iloc[:,0].dropna().astype(str) != df.iloc[:,1].dropna().astype(str), df.iloc[:,0].dropna().astype(str)+"-"+df.iloc[:,1].dropna().astype(str),df.iloc[:,1].dropna().astype(str)))

運行情況下面命令，它需要大約55.3毫秒的

甚至

t1=df.iloc[:,0].dropna().astype(str).where(df.iloc[:,0].dropna().astype(str) == df.iloc[:,1].dropna().astype(str), df.iloc[:,0].dropna().astype(str)+"-"+df.iloc[:,1].dropna().astype(str))

既消耗幾乎同一時間（55毫秒相對長爲150秒），但是問題是它僅適用於2列是適用的。我想創建一個通用語句，以便它可以處理n個列。我嘗試使用減少頂部，但它給了錯誤，而我嘗試了3列。

reduce((lambda x,y:pd.Series(np.where(df.iloc[:,x].dropna().astype(str) != df.iloc[:,y].dropna().astype(str), df.iloc[:,x].dropna().astype(str)+"-"+df.iloc[:,y].dropna().astype(str),df.iloc[:,y].dropna().astype(str)))),list(range(df.shape[1])))

TypeError: '>=' not supported between instances of 'str' and 'int'

請注意，DF實際上是一個多核並行任務的一大塊。如果這些建議不包括並行性，那將會很棒。

來源

2017-06-05 niths4u

嘗試

df['new'] = df.astype('str').apply(lambda x: '_'.join(set(x)), axis = 1) 

    A B C new 
0 X1 AX X1 AX_X1 
1 X2 X2 X1 X1_X2 
2 X3 X3 X3 X3 
3 X4 XX XX X4_XX

編輯：保持列的順序值

def my_append(x): 
    l = [] 
    for elm in x: 
     if elm not in l: 
      l.append(elm) 
    return '_'.join(l) 


df['New col']=df.astype('str').apply(my_append, axis = 1) 

1000 loops, best of 3: 871 µs per loop

A B C New col 
0 X1 AX X1 X1_AX 
1 X2 X2 X1 X2_X1 
2 X3 X3 X3 X3 
3 X4 XX XX X4_XX

編輯1：如果您有男在任一列這樣

A B C 
0 X1 AX X1 
1 X2 X2 X1 
2 X3 X3 X3 
3 NaN XX XX

手柄，在功能，然後應用

def my_append(x): 
l = [] 
for elm in x: 
    if elm not in l: 
     l.append(elm) 
l = [x for x in l if str(x) != 'nan'] 
return '_'.join(l) 

df['New col']=df.astype('str').apply(my_append, axis = 1) 


    A B C New col 
0 X1 AX X1 X1_AX 
1 X2 X2 X1 X2_X1 
2 X3 X3 X3 X3 
3 NaN XX XX XX

來源

2017-06-05 17:01:04 Vaishali

抱歉，但正如我所說，我需要保持秩序。設置鍵。指數設置給出了錯誤，並沒有太多的時間收益要麼 – niths4u

是的，我注意到，後來，請參閱編輯 – Vaishali

哇。這確實起到了訣竅，新代碼只需要2秒，而150秒。謝謝。有一個疑問。那麼dropna（）呢？不應該一起添加嗎？ – niths4u

pd.unique不排序。用它包裹在一個修真

df.assign(new_col=['_'.join(pd.unique(row)) for row in df.values]) 

    A B C new_col 
0 X1 AX X1 X1_AX 
1 X2 X2 X1 X2_X1 
2 X3 X3 X3  X3 
3 X4 XX XX X4_XX

手柄的NaN

df.assing(new_col=[ 
     '_'.join(pd.unique([i for i in row if pd.notnull(i)])) for row in df.values 
    ])

來源

2017-06-05 18:15:45 piRSquared

它不處理NaN 。 – niths4u

@ niths4u已更新 – piRSquared

它現在可以工作了，謝謝。％timeit花了大約2.71秒 – niths4u

python熊貓通過刪除重複項來加入動態列

回答

相關問題