2017-05-04 25 views
1

我有下面的這個數據框。實際的數據非常龐大,有很多NaN。從單元中獲取有效值並使用Python中的pandas更新列

 Date  ID  Code Value Value1 Value2  Value3 
0 1945-12-30 H0010603 ZZZ008-2 zzz=ID AAC=10 NaN  NaN 
1 1945-12-30 H0010603 ZZZ008-2 zzz=ID AAC=01 NaN  NaN 
2 1945-12-30 H0010603 ZZZ008-2 NaN  NaN VEC=1  NaN 
3 1945-12-30 H0010603 ZZZ008-2 NaN  NaN VEC=2 AAC= 1 A 
4 1945-12-30 H0010603 ZZZ008-2 NaN  NaN VEC=3 AAC= 1 A 

這是最終的預期數據。

 Date  ID  Code zzz  AAC VEC  AAC.1 
0 1945-12-30 H0010603 ZZZ008-2  ID  10 NaN  NaN 
1 1945-12-30 H0010603 ZZZ008-2  ID  01 NaN  NaN 
2 1945-12-30 H0010603 ZZZ008-2 NaN  NaN  1  NaN 
3 1945-12-30 H0010603 ZZZ008-2 NaN  NaN  2  1 A 
4 1945-12-30 H0010603 ZZZ008-2 NaN  NaN  3  1 A 

我需要實際更新列名稱與單元格中的值。

df = pd.read_excel(xlPath, 0) 
writer = pd.ExcelWriter(xlPath, 
         engine='xlsxwriter', 
         date_format='mm/dd/yyy', 
         datetime_format='mm/dd/yyyy') 
df = df.fillna('') 
for ColumnName, values in df.iteritems(): 
    for index, value in enumerate(values): 
     if '=' in str(value): 
      df.set_value(index, ColumnName, str(value).split('=')[1]) 
      NewColumnName = str(value).split('=')[0] 
      df.rename(columns={ColumnName: NewColumnName}, inplace=True) 

df.to_excel(writer, index=False) 
writer.save() 

但由於一列是越來越重複,這是越來越出錯。 所以,我想,我可以通過df循環並獲得給定列中的第一個有效值並將其放入列表中。

AllColumns = list(df.columns.values) 
NewColNameList = [] 
for ColumnName, values in df.iteritems(): 
    a = 0 
    for index, value in enumerate(values): 
     while a < len(values): 
      if '=' in str(value): 
       if value != '': 
        print(index, values) 
        NewColNameList.append(value) 
        break 
       a += 1 
print(NewColNameList) 

但我在While循環中並不像我想象的那麼強壯。任何幫助獲得所需的DF是值得讚賞的。

回答

1

IIUC:

數據集:

In [314]: df 
Out[314]: 
     Date  ID  Code Value Value1 Value2 Value3 
0 1945-12-30 H0010603 ZZZ008-2 zzz=ID AAC=10 NaN  NaN 
1 1945-12-30 H0010603 ZZZ008-2 zzz=ID AAC=01 NaN  NaN 
2 1945-12-30 H0010603 ZZZ008-2  NaN  NaN VEC=1  NaN 
3 1945-12-30 H0010603 ZZZ008-2  NaN  NaN VEC=2 AAC= 1 A 
4 1945-12-30 H0010603 ZZZ008-2  NaN  NaN VEC=3 AAC= 1 A 

解決方案:

def get_col_name(col): 
    if col.dtype != object: 
     return col.name 
    s = col.loc[col.str.contains(r'\w+\=').idxmax()] 
    if s and '=' in s: 
     return s.split('=')[0] 
    return col.name 

df = (df.rename(columns=lambda x: get_col_name(df[x]) if x.startswith('Value') else x) 
     .replace(r'\w+\=', '', regex=True)) 

結果:

In [83]: %paste 
df = (df.rename(columns=lambda x: get_col_name(df[x]) if x.startswith('Value') else x) 
     .replace(r'\w+\=', '', regex=True)) 
## -- End pasted text -- 

In [84]: df 
Out[84]: 
     Date  ID  Code zzz AAC VEC AAC 
0 1945-12-30 H0010603 ZZZ008-2 ID 10 NaN NaN 
1 1945-12-30 H0010603 ZZZ008-2 ID 01 NaN NaN 
2 1945-12-30 H0010603 ZZZ008-2 NaN NaN 1 NaN 
3 1945-12-30 H0010603 ZZZ008-2 NaN NaN 2 1 A 
4 1945-12-30 H0010603 ZZZ008-2 NaN NaN 3 1 A 
+0

這隻解決了一半的問題。我還需要從需要從剩餘單元格中刪除的文本添加列名稱。有沒有什麼辦法可以做到這一點? – Naveen

+0

@Naveen,請檢查我更新的帖子 – MaxU

相關問題