2017-02-28 112 views
1

我發佈了一個「第1部分」的問題,讓我回到我需要的功能here的答案,但認爲這證明了自己的問題。如果不是,我會刪除。案例不敏感的替換(映射)

我想將一個函數應用於一個數據框,該數據框將全稱替換爲縮寫(New York -> NY)。然而,我注意到在我的數據集中,如果一個國家是大寫字母,它顯然不會匹配該字幕。我試圖解決它,但似乎無法破解密碼:

import pandas as pd 
import numpy as np 
dfp = pd.DataFrame({'A' : [np.NaN,np.NaN,3,4,5,5,3,1,5,np.NaN], 
        'B' : [1,0,3,5,0,0,np.NaN,9,0,0], 
        'C' : ['Pharmacy of IDAHO','NY Pharma','NJ Pharmacy','Idaho Rx','CA Herbals','Florida Pharma','AK RX','Ohio Drugs','PA Rx','USA Pharma'], 
        'D' : [123456,123456,1234567,12345678,12345,12345,12345678,123456789,1234567,np.NaN], 
        'E' : ['Assign','Unassign','Assign','Ugly','Appreciate','Undo','Assign','Unicycle','Assign','Unicorn',]}) 

import us 
statez = us.states.mapping('abbr', 'name') 
inv_map = {v: k for k, v in statez.items()} 

def replace_states(company): 
    # find all states that exist in the string 
    state_found = filter(lambda state: state.lower() in company.lower(), statez.values()) 

    # replace each state with its abbreviation 
    for state in state_found: 
     print(state, inv_map[state]) 
     company = company.replace(state, inv_map[state]) 
     print("---" , company) 

    # return the modified string (or original if no states were found) 
    return company 

dfp['C'] = dfp['C'].map(replace_states) 

輸出:注意缺少變化的「愛達荷藥房」

Idaho ID 
--- Pharmacy of IDAHO 
Idaho ID 
--- ID Rx 
Florida FL 
--- FL Pharma 
Ohio OH 
--- OH Drug 

有沒有一種方法,使這個函數不區分大小寫?

回答

0

與他們的縮寫來代替國家的名稱(不區分大小寫矢量化解決方案):

t1 = dfp.C.str.split(expand=True) 
t2 = t1.stack().str.title().map(inv_map).unstack() 
t1[t2.notnull()] = t2 
dfp['new'] = t1.stack().groupby(level=0).agg(' '.join) 

結果:

In [152]: x 
Out[152]: 
    A B     C   D   E    new 
0 NaN 1.0 Pharmacy of IDAHO  123456.0  Assign Pharmacy of ID 
1 NaN 0.0   NY Pharma  123456.0 Unassign  NY Pharma 
2 3.0 3.0  NJ Pharmacy 1234567.0  Assign  NJ Pharmacy 
3 4.0 5.0   Idaho Rx 12345678.0  Ugly   ID Rx 
4 5.0 0.0   CA Herbals  12345.0 Appreciate  CA Herbals 
5 5.0 0.0  Florida Pharma  12345.0  Undo  FL Pharma 
6 3.0 NaN    AK RX 12345678.0  Assign   AK RX 
7 1.0 9.0   Ohio Drugs 123456789.0 Unicycle  OH Drugs 
8 5.0 0.0    PA Rx 1234567.0  Assign   PA Rx 
9 NaN 0.0   USA Pharma   NaN  Unicorn  USA Pharma 

說明:

In [155]: t1 = dfp.C.str.split(expand=True) 

In [156]: t1 
Out[156]: 
      0   1  2 
0 Pharmacy  of IDAHO 
1  NY Pharma None 
2  NJ Pharmacy None 
3  Idaho  Rx None 
4  CA Herbals None 
5 Florida Pharma None 
6  AK  RX None 
7  Ohio  Drugs None 
8  PA  Rx None 
9  USA Pharma None 

In [157]: t2 = t1.stack().str.title().map(inv_map).unstack() 

In [158]: t2 
Out[158]: 
    0 1  2 
0 NaN NaN ID 
1 NaN NaN None 
2 NaN NaN None 
3 ID NaN None 
4 NaN NaN None 
5 FL NaN None 
6 NaN NaN None 
7 OH NaN None 
8 NaN NaN None 
9 NaN NaN None 

In [159]: t1[t2.notnull()] = t2 

In [160]: t1 
Out[160]: 
      0   1  2 
0 Pharmacy  of ID 
1  NY Pharma None 
2  NJ Pharmacy None 
3  ID  Rx None 
4  CA Herbals None 
5  FL Pharma None 
6  AK  RX None 
7  OH  Drugs None 
8  PA  Rx None 
9  USA Pharma None 

更換狀態縮寫與他們的名字(不區分大小寫矢量化解決方案):

In [88]: dfp['state'] = dfp.C.str.extract(r'\b([A-Z]{2})\b', expand=False) 

In [89]: dfp 
Out[89]: 
    A B     C   D   E state 
0 NaN 1.0 Pharmacy of IDAHO  123456.0  Assign NaN 
1 NaN 0.0   NY Pharma  123456.0 Unassign NY 
2 3.0 3.0  NJ Pharmacy 1234567.0  Assign NJ 
3 4.0 5.0   Idaho Rx 12345678.0  Ugly NaN 
4 5.0 0.0   CA Herbals  12345.0 Appreciate CA 
5 5.0 0.0  Florida Pharma  12345.0  Undo NaN 
6 3.0 NaN    AK RX 12345678.0  Assign AK 
7 1.0 9.0   Ohio Drugs 123456789.0 Unicycle NaN 
8 5.0 0.0    PA Rx 1234567.0  Assign PA 
9 NaN 0.0   USA Pharma   NaN  Unicorn NaN 

In [90]: dfp.C = dfp.C.replace(dfp.state.tolist(), 
           dfp.state.map(statez).tolist(), 
           regex=True) 

In [91]: dfp 
Out[91]: 
    A B     C   D   E state 
0 NaN 1.0 Pharmacy of IDAHO  123456.0  Assign NaN 
1 NaN 0.0  New York Pharma  123456.0 Unassign NY 
2 3.0 3.0 New Jersey Pharmacy 1234567.0  Assign NJ 
3 4.0 5.0    Idaho Rx 12345678.0  Ugly NaN 
4 5.0 0.0 California Herbals  12345.0 Appreciate CA 
5 5.0 0.0  Florida Pharma  12345.0  Undo NaN 
6 3.0 NaN   Alaska RX 12345678.0  Assign AK 
7 1.0 9.0   Ohio Drugs 123456789.0 Unicycle NaN 
8 5.0 0.0  Pennsylvania Rx 1234567.0  Assign PA 
9 NaN 0.0   USA Pharma   NaN  Unicorn NaN 
+0

我知道它有點違反直覺,但我實際上想從完整的國家名稱到縮寫版本。例如:'Ohio - > OH' – MattR

+0

@MattR,嗯。 ..,這使得它更具挑戰性。讓我嘗試另一種解決方案... – MaxU

+0

已經進行了一些編輯,所以我不確定我之前發佈了哪些內容,但第一部分完全符合我的需求。但是,我不知道你是怎麼做到的!但它非常出色。任何解釋都會很棒,但並不需要尊重你的時間和幫助! – MattR

0

我會找到它的指數,然後用它來替換它不區分大小寫:

# replace each state with its abbreviation 
    for state in state_found: 
     print(state, inv_map[state]) 
     index = company.lower().find(state.lower()) 
     company = company.replace(company[index:index + len(state)], inv_map[state]) 
     print("---" , company) 

這保留的情況下該字符串的所有其他部分。

+0

我爲我的困惑表示歉意,但你能解釋一下在何處放置此代碼,也許解釋它背後的原因?當我把它放在我的循環中時,我得到了瘋狂的輸出。 – MattR

+0

@MattR我已經添加了其他代碼來幫助您放置它。如果不正確,請讓我知道你得到的輸出。 – TemporalWolf

+0

我添加了一些示例代碼,以便海報可以使用我的測試數據框。但這裏是我的電流輸出'愛達荷州ID --- IDPIDhIDaIDrIDmIDaIDcIDyID IDoIDfID IDIIDDIDAIDHIDOID 愛達荷州ID --- IDIIDdIDaIDhIDoID IDRIDxID' – MattR