2016-01-20 139 views
2

我有一個pandas.DataFrame,我需要根據需要更新的列中的值根據列中的值進行更新。 NAME被命名爲別的,因爲我知道這是不好的做法。這只是例子。根據其他列值編輯列值

以下是我與工作的一個樣本:

import re 
import pandas as pd 

def anydigit(text): 
    find_digit = re.search(r'\d+', text) 
    if find_digit: 
     return find_digit.start() 
    else: 
     return 0 

df = pd.DataFrame({'DPID': ['A1', 'A2'], 'NAME': ['John Doe', 'Jane Doe'], 
        'ADDR_1': ['123 MAIN ST', 'ATTN: JOHN DOE'], 'ADDR_2': ['', 'P O BOX 123456']}) 
df['addr_ad1'] = df['ADDR_1'].apply(anydigit) 
df['addr_ad2'] = df['ADDR_2'].apply(anydigit) 
df['AUX_ADDR_LINE'] = '' 

這是需要採取什麼措施。

If addr_ad1 == 0 and addr_ad2 > 0: 
    aux_addr_line = addr_1 
    addr_1 = addr_2 
    addr_2 = '' 
elfif addr_ad1 > 0 and re.sub(r'\s+', '', addr_2)[:4] == 'POBOX': 
    aux_addr_line = '' 
    addr_1 = addr_1 
    addr_2 = '' 
elif addr_ad2 > 0 and re.sub(r'\s+', '', addr_1)[:4] == 'POBOX': 
    aux_addr_line = '' 
    addr_1 = addr_2 
    addr_2 = '' 

我會認爲.apply()會工作,但不知道我會怎麼寫。

回答

0

調整一些變量名:

def anydigit(text): 
    find_digit = re.search(r'\d+', text) 
    if find_digit: 
     return find_digit.start() 
    else: 
     return 0 

df = pd.DataFrame({'DPID': ['A1', 'A2'], 'NAME': ['John Doe', 'Jane Doe'], 
        'addr_1': ['123 MAIN ST', 'ATTN: JOHN DOE'], 'addr_2': ['', 'P O BOX 123456']}) 
df['addr_ad1'] = df['addr_1'].apply(anydigit) 
df['addr_ad2'] = df['addr_2'].apply(anydigit) 
df['aux_addr_line'] = '' 

入手:

DPID  NAME   addr_1   addr_2 addr_ad1 addr_ad2 \ 
0 A1 John Doe  123 MAIN ST       0   0 
1 A2 Jane Doe ATTN: JOHN DOE P O BOX 123456   0   8 

    aux_addr_line 
0     
1    

定義一個函數,然後apply到所有行:

def change_address(row): 
    if row.addr_ad1 == 0 and row.addr_ad2 > 0: 
     row.aux_addr_line = row.addr_1 
     row.addr_1 = row.addr_2 
     row.addr_2 = '' 
    elif row.addr_ad1 > 0 and re.sub(r'\s+', '', row.addr_2)[:4] == 'POBOX': 
     row.aux_addr_line = '' 
     row.addr_1 = row.addr_1 
     row.addr_2 = '' 
    elif row.addr_ad2 > 0 and re.sub(r'\s+', '', row.addr_1)[:4] == 'POBOX': 
     row.aux_addr_line = '' 
     row.addr_1 = row.addr_2 
     row.addr_2 = '' 
    return row 

df = df.apply(change_address, axis=1) 

獲得:

DPID  NAME   addr_1 addr_2 addr_ad1 addr_ad2 aux_addr_line 
0 A1 John Doe  123 MAIN ST    0   0     
1 A2 Jane Doe P O BOX 123456    0   8 ATTN: JOHN DOE 
+0

這正是我所期待的!我從數據框中刪除了addr_ad1和addr_ad2列,因爲它們只用於計算。我將它們添加到函數中,如下所示:addr_ad1 = anydigit(row.addr_1)。謝謝! –

+0

不客氣。 – Stefan

相關問題