2016-01-13 22 views
7

我試過四處搜索,找不到一個簡單的方法來做到這一點,所以我希望你的專業知識可以提供幫助。Python Pandas使用另一列刪除子字符串

我有兩列

import numpy as np 
import pandas as pd 

pd.options.display.width = 1000 
testing = pd.DataFrame({'NAME':[ 
    'FIRST', np.nan, 'NAME2', 'NAME3', 
    'NAME4', 'NAME5', 'NAME6'], 'FULL_NAME':['FIRST LAST', np.nan, 'FIRST LAST', 'FIRST NAME3', 'FIRST NAME4 LAST', 'ANOTHER NAME', 'LAST NAME']}) 

這給了我熊貓數據幀

  FULL_NAME NAME 
0  FIRST LAST FIRST 
1    NaN NaN 
2  FIRST LAST NAME2 
3  FIRST NAME3 NAME3 
4 FIRST NAME4 LAST NAME4 
5  ANOTHER NAME NAME5 
6   LAST NAME NAME6 

什麼,我想要做的就是從「名稱」列中取值,然後刪除如果它在那裏,則從'完整名稱'列。所以函數將返回

  FULL_NAME NAME   NEW 
0  FIRST LAST FIRST   LAST 
1    NaN NaN   NaN 
2  FIRST LAST NAME2 FIRST LAST 
3  FIRST NAME3 NAME3   FIRST 
4 FIRST NAME4 LAST NAME4 FIRST LAST 
5  ANOTHER NAME NAME5 ANOTHER NAME 
6   LAST NAME NAME6  LAST NAME 

到目前爲止,我已經定義了一個函數,並使用apply方法。儘管我的大數據集運行速度很慢,但我希望有一種更有效的方法來實現它。謝謝!

def address_remove(x): 
    try: 
     newADDR1 = re.sub(x['NAME'], '', x[-1]) 
     newADDR1 = newADDR1.rstrip() 
     newADDR1 = newADDR1.lstrip() 
     return newADDR1 
    except: 
     return x[-1] 

回答

4

這裏是一個解決方案,它比當前解決方案相當快一點,我不相信就不會有更快的東西雖然

In [13]: import numpy as np 
     import pandas as pd 
     n = 1000 
     testing = pd.DataFrame({'NAME':[ 
     'FIRST', np.nan, 'NAME2', 'NAME3', 
     'NAME4', 'NAME5', 'NAME6']*n, 'FULL_NAME':['FIRST LAST', np.nan, 'FIRST LAST', 'FIRST NAME3', 'FIRST NAME4 LAST', 'ANOTHER NAME', 'LAST NAME']*n}) 

這是怎樣的一個長一班輪但它應該做的,你需要

禁食解決方案,我可以拿出作爲另一個答覆中提到正在使用replace

In [37]: %timeit testing ['NEW2'] = [e.replace(k, '') for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))] 
100 loops, best of 3: 4.67 ms per loop 

原來的答覆:

In [14]: %timeit testing ['NEW'] = [''.join(str(e).split(k)) for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))] 
100 loops, best of 3: 7.24 ms per loop 

比當前解決方案:

In [16]: %timeit testing['NEW1'] = testing.apply(address_remove, axis=1) 
10 loops, best of 3: 166 ms per loop 

這些讓你相同的答案作爲當前解決方案

+0

太棒了!我試圖想出第二個解決方案,但第三個解決方案更好!你介意告訴我「zip」命令在做什麼? – Link

+0

很高興工作! 'zip'需要多次迭代,並從原始迭代中返回聚合的迭代器。用更多的術語來說,它允許你同時循環兩個或多個迭代。 https://docs.python.org/3/library/functions.html#zip – johnchase

0

我想你想使用的替換()方法的字符串,它比使用正則表達式快幾個數量級(我剛剛在IPython中進行了檢查):

%timeit mystr.replace("ello", "") 
The slowest run took 7.64 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 250 ns per loop 

%timeit re.sub("ello","", "e") 
The slowest run took 21.03 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 4.7 µs per loop 

如果你需要進一步的速度改進之後,你應該看看numpy的矢量化函數(但我認爲使用replace代替正則表達式的速度應該相當可觀)。

2

您可以用replace方法和regex參數做出來,然後用str.strip

In [605]: testing.FULL_NAME.replace(testing.NAME[testing.NAME.notnull()], '', regex = True).str.strip() 
Out[605]: 
0   LAST 
1    NaN 
2  FIRST LAST 
3   FIRST 
4  FIRST LAST 
5 ANOTHER NAME 
6  LAST NAME 
Name: FULL_NAME, dtype: object 

你需要傳遞notnulltesting.NAME,因爲沒有它NaN值也將被替換爲空字符串

基準測試是比較慢然後最快的@johnchase解決方案,但我認爲它更具可讀性並且使用所有pandas meth數據幀和系列的數據:

In [607]: %timeit testing['NEW'] = testing.FULL_NAME.replace(testing.NAME[testing.NAME.notnull()], '', regex = True).str.strip() 
100 loops, best of 3: 4.56 ms per loop 

In [661]: %timeit testing ['NEW'] = [e.replace(k, '') for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))] 
1000 loops, best of 3: 450 µs per loop 
+0

純熊貓解決方案。好工作。讀起來更容易,即使速度不快。 – floydn

+0

'df'應該在你的代碼中進行測試嗎? – johnchase

+0

@johnchase是的,對不起。這是在控制檯 –

相關問題