Python Pandas使用另一列刪除子字符串

我試過四處搜索，找不到一個簡單的方法來做到這一點，所以我希望你的專業知識可以提供幫助。Python Pandas使用另一列刪除子字符串

我有兩列

import numpy as np 
import pandas as pd 

pd.options.display.width = 1000 
testing = pd.DataFrame({'NAME':[ 
    'FIRST', np.nan, 'NAME2', 'NAME3', 
    'NAME4', 'NAME5', 'NAME6'], 'FULL_NAME':['FIRST LAST', np.nan, 'FIRST LAST', 'FIRST NAME3', 'FIRST NAME4 LAST', 'ANOTHER NAME', 'LAST NAME']})

這給了我熊貓數據幀

  FULL_NAME NAME 
0  FIRST LAST FIRST 
1    NaN NaN 
2  FIRST LAST NAME2 
3  FIRST NAME3 NAME3 
4 FIRST NAME4 LAST NAME4 
5  ANOTHER NAME NAME5 
6   LAST NAME NAME6

什麼，我想要做的就是從「名稱」列中取值，然後刪除如果它在那裏，則從'完整名稱'列。所以函數將返回

  FULL_NAME NAME   NEW 
0  FIRST LAST FIRST   LAST 
1    NaN NaN   NaN 
2  FIRST LAST NAME2 FIRST LAST 
3  FIRST NAME3 NAME3   FIRST 
4 FIRST NAME4 LAST NAME4 FIRST LAST 
5  ANOTHER NAME NAME5 ANOTHER NAME 
6   LAST NAME NAME6  LAST NAME

到目前爲止，我已經定義了一個函數，並使用apply方法。儘管我的大數據集運行速度很慢，但我希望有一種更有效的方法來實現它。謝謝！

def address_remove(x): 
    try: 
     newADDR1 = re.sub(x['NAME'], '', x[-1]) 
     newADDR1 = newADDR1.rstrip() 
     newADDR1 = newADDR1.lstrip() 
     return newADDR1 
    except: 
     return x[-1]

來源

2016-01-13 Link

這裏是一個解決方案，它比當前解決方案相當快一點，我不相信就不會有更快的東西雖然

In [13]: import numpy as np 
     import pandas as pd 
     n = 1000 
     testing = pd.DataFrame({'NAME':[ 
     'FIRST', np.nan, 'NAME2', 'NAME3', 
     'NAME4', 'NAME5', 'NAME6']*n, 'FULL_NAME':['FIRST LAST', np.nan, 'FIRST LAST', 'FIRST NAME3', 'FIRST NAME4 LAST', 'ANOTHER NAME', 'LAST NAME']*n})

這是怎樣的一個長一班輪但它應該做的，你需要

禁食解決方案，我可以拿出作爲另一個答覆中提到正在使用replace：

In [37]: %timeit testing ['NEW2'] = [e.replace(k, '') for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))] 
100 loops, best of 3: 4.67 ms per loop

原來的答覆：

In [14]: %timeit testing ['NEW'] = [''.join(str(e).split(k)) for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))] 
100 loops, best of 3: 7.24 ms per loop

比當前解決方案：

In [16]: %timeit testing['NEW1'] = testing.apply(address_remove, axis=1) 
10 loops, best of 3: 166 ms per loop

這些讓你相同的答案作爲當前解決方案

來源

2016-01-13 18:58:28 johnchase

太棒了！我試圖想出第二個解決方案，但第三個解決方案更好！你介意告訴我「zip」命令在做什麼？ – Link

很高興工作！ 'zip'需要多次迭代，並從原始迭代中返回聚合的迭代器。用更多的術語來說，它允許你同時循環兩個或多個迭代。 https://docs.python.org/3/library/functions.html#zip – johnchase

我想你想使用的替換（）方法的字符串，它比使用正則表達式快幾個數量級（我剛剛在IPython中進行了檢查）：

%timeit mystr.replace("ello", "") 
The slowest run took 7.64 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 250 ns per loop 

%timeit re.sub("ello","", "e") 
The slowest run took 21.03 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 4.7 µs per loop

如果你需要進一步的速度改進之後，你應該看看numpy的矢量化函數（但我認爲使用replace代替正則表達式的速度應該相當可觀）。

來源

2016-01-13 19:02:02

您可以用replace方法和regex參數做出來，然後用str.strip：

In [605]: testing.FULL_NAME.replace(testing.NAME[testing.NAME.notnull()], '', regex = True).str.strip() 
Out[605]: 
0   LAST 
1    NaN 
2  FIRST LAST 
3   FIRST 
4  FIRST LAST 
5 ANOTHER NAME 
6  LAST NAME 
Name: FULL_NAME, dtype: object

注你需要傳遞notnull到testing.NAME，因爲沒有它NaN值也將被替換爲空字符串

基準測試是比較慢然後最快的@johnchase解決方案，但我認爲它更具可讀性並且使用所有pandas meth數據幀和系列的數據：

In [607]: %timeit testing['NEW'] = testing.FULL_NAME.replace(testing.NAME[testing.NAME.notnull()], '', regex = True).str.strip() 
100 loops, best of 3: 4.56 ms per loop 

In [661]: %timeit testing ['NEW'] = [e.replace(k, '') for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))] 
1000 loops, best of 3: 450 µs per loop

來源

2016-01-13 19:32:48

純熊貓解決方案。好工作。讀起來更容易，即使速度不快。 – floydn

'df'應該在你的代碼中進行測試嗎？ – johnchase

@johnchase是的，對不起。這是在控制檯 –

Python Pandas使用另一列刪除子字符串

回答

相關問題