2016-01-14 25 views
0

我想用一個更短的字符串替換我的數據框中的長字符串。我有一個我想做的替代品的簡短字典。如何在我的數據框中將各種長字符串替換爲較短的字符串?

import pandas as pd 
from StringIO import StringIO 

replacement_dict = { 
    "substring1": "substring1", 
    "substring2": "substring2", 
    "a short substring": "substring3", 
} 

exampledata = StringIO("""id;Long String 
1;This is a long substring1 of text that has lots of words 
2;This is substring2 and also contains more text than needed 
3;This is a long substring1 of text that has lots of words 
4;This is substring2 and also contains more text than needed 
5;This is substring2 and also contains more text than needed 
6;This is substring2 and also contains more text than needed 
7;Within this string is a short substring that is unique 
8;This is a long substring1 of text that has lots of words 
9;Within this string is a short substring that is unique 
10;Within this string is a short substring that is unique 
""") 

df = pd.read_csv(exampledata, sep=";") 
print df 

for s in replacement_dict.keys(): 
    if df['Long String'].str.contains(s): 
     df['Long String'] = replacement_dict[df['Long String'].str.contains(s)] 

預期的數據幀是這樣的:

id Long String 
0 1 substring1 
1 2 substring2 
2 3 substring1 
3 4 substring2 
4 5 substring2 
5 6 substring2 
6 7 substring3 
7 8 substring1 
8 9 substring3 
9 10 substring3 

當我運行的代碼,上面,我得到這個錯誤:

Traceback (most recent call last): 
    File "test.py", line 27, in <module> 
    if df['Long String'].str.contains(s): 
    File "h:\Anaconda\lib\site-packages\pandas\core\generic.py", line 731, in __nonzero__.format(self.__class__.__name__)) 
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). 

我怎麼能代替各家之長串在我的數據框中有更短的字符串?

回答

1

你可以用.replace()做這種類型的事情。但是,您將不得不稍微修改您的字典以獲得您期望的結果。

replacement_dict = { 
    ".*substring1.*": "substring1", 
    ".*substring2.*": "substring2", 
    ".*a short substring.*": "substring3", 
} 

我做了什麼使鍵是一個正則表達式字符串。它會匹配你想匹配的子串之前的所有內容。這在一分鐘內會很重要。

接下來,替換整個for循環,有以下:

df['Long String'] = df['Long String'].replace(replacement_dict, regex=True) 

.replace()可以採取一本字典,其中鍵是你相匹配的字符串和值替換文本。更改鍵來捕獲子字符串之前和之後的所有內容的原因是,我們現在可以替換整個值,而不是隻是一個小的匹配字符串。

例如,沒有.*部分字典將轉換爲數據幀像這樣:

id          Long String 
0 1 This is a long substring1 of text that has lot... 
1 2 This is substring2 and also contains more text... 
2 3 This is a long substring1 of text that has lot... 
3 4 This is substring2 and also contains more text... 
4 5 This is substring2 and also contains more text... 
5 6 This is substring2 and also contains more text... 
6 7 Within this string is substring3 that is unique 
7 8 This is a long substring1 of text that has lot... 
8 9 Within this string is substring3 that is unique 
9 10 Within this string is substring3 that is unique 

注意,你真正看到的唯一變化是與「短串」的價值觀,因爲你是真的只是用自己替換「substring1」和「substring2」。

現在,如果我們加上正則表達式通配符的時候,我們得到這樣的:

id Long String 
0 1 substring1 
1 2 substring2 
2 3 substring1 
3 4 substring2 
4 5 substring2 
5 6 substring2 
6 7 substring3 
7 8 substring1 
8 9 substring3 
9 10 substring3 
相關問題