2014-04-24 49 views
0

我傾向於常規獲取具有許多相似列的數據文件,但對於每一行,只有其中一列實際上有任何數據。雖然有時它只是看起來這樣。理想情況下,我想要做的是有一個函數,我可以輸入要檢查的列的列表,並且對於任何包含1個值的行都有一行將這些列組合在一起,並將該列更改爲NaN,以便我可以輕鬆地刪除最後的多餘列。如果多列有數據,則不合並/更改該行。將列合併在一起,如果其他值爲空

因此,例如,我有這樣的DF

df = pd.DataFrame({ 
       "id": pd.Series([1,2,3,4,5,6,7]), 
       "a1": pd.Series(['a',np.NaN,np.NaN,'c','d',np.NaN, np.NaN]), 
       "a2": ([np.NaN,'b','c',np.NaN,'d','e', np.NaN]), 
       "a3": ([np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN, 'f']) 
       }) 

代碼明智的,現在我有這個

import pandas as pd 
import numpy as np  
def test(row, index, combined): 
    values = 0 
    foundix = 0 
    #check which if any column has data 
    for ix in index: 
     if not (pd.isnull(row[ix])): 
      values = values + 1 
      foundix = ix 
    #check that it found only 1 value, if so clean up 
    if (values == 1): 
     row[combined] = row[foundix] 
     for ix in index: 
      row[ix] = np.NaN 
    return row 

df["a"] = np.NaN 
df.apply(lambda x: test(x, ["a1", "a2", "a3"], "a"), 1) 
print df 

所以問題,我有我的代碼是

  1. 我的感覺這是去解決我的問題的錯誤方向
  2. 我沒有滿我知道如何讓我的應用函數實際應用到行來改變它。

我的理想輸出將是(主要是幫助事後清理數據和處理怪異的情況下):

a1 a2 a3 id a 
0 NaN NaN NaN 1 a 
1 NaN NaN NaN 2 b 
2 NaN NaN NaN 3 c 
3 NaN NaN NaN 4 c 
4 d d NaN 5 NaN 
5 NaN NaN NaN 6 e 
6 NaN NaN NaN 7 f 
+0

您也可以upvote ;-) – EdChum

回答

0

我的方法似乎稍微快:

In [415]: 

df = pd.DataFrame({ 
       "id": pd.Series([1,2,3,4,5,6,7]), 
       "a1": pd.Series(['a',np.NaN,np.NaN,'c','d',np.NaN, np.NaN]), 
       "a2": ([np.NaN,'b','c',np.NaN,'d','e', np.NaN]), 
       "a3": ([np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN, 'f']) 
       }) 
df 
Out[415]: 
    a1 a2 a3 id 
0 a NaN NaN 1 
1 NaN b NaN 2 
2 NaN c NaN 3 
3 c NaN NaN 4 
4 d d NaN 5 
5 NaN e NaN 6 
6 NaN NaN f 7 

[7 rows x 4 columns] 
In [416]: 

def gen_col(x): 
    if len(x.dropna()) > 1: 
     return NaN 
    else: 
     return x.dropna().values.max() 

import pandas as pd 
import numpy as np  
def test(row, index, combined): 
    values = 0 
    foundix = 0 
    #check which if any column has data 
    for ix in index: 
     if not (pd.isnull(row[ix])): 
      values = values + 1 
      foundix = ix 
    #check that it found only 1 value, if so clean up 
    if (values == 1): 
     row[combined] = row[foundix] 
     for ix in index: 
      row[ix] = np.NaN 
    return row 
%timeit df.apply(lambda x: test(x, ["a1", "a2", "a3"], "a"), 1) 
%timeit df['a'] = df[['a1','a2','a3']].apply(lambda row: gen_col(row), axis=1) 
df 
100 loops, best of 3: 7.08 ms per loop 
100 loops, best of 3: 3.24 ms per loop 
Out[416]: 
    a1 a2 a3 id a 
0 a NaN NaN 1 a 
1 NaN b NaN 2 b 
2 NaN c NaN 3 c 
3 c NaN NaN 4 c 
4 d d NaN 5 NaN 
5 NaN e NaN 6 e 
6 NaN NaN f 7 f 

[7 rows x 5 columns] 

的關鍵我在這裏做的事情是在刪除所有NaN值後檢查值的數量,這似乎比您的代碼更快

相關問題