2017-09-05 121 views
8

我有一個CSV文件的單個列列這是這樣的:熊貓生成字符串

index,labels 
1,created the tower 
2,destroyed the tower 
3,created the swimming pool 
4,destroyed the swimming pool 

現在,如果我通過我想到位標籤列的列的列表(不包含所有在標籤欄的話)

['created','tower','destroyed','swimming pool'] 

我想獲得的數據框:

index,created,destroyed,tower,swimming pool 
1,1,0,1,0 
2,0,1,1,0 
3,1,0,0,1 
4,0,1,0,1 

我看着get_du mmies,但沒有幫助

回答

8

您可以在循環中調用str.contains

print(df) 

         labels 
0   created the tower 
1   destroyed the tower 
2 created the swimming pool 
3 destroyed the swimming pool 

req = ['created', 'destroyed', 'tower', 'swimming pool'] 

out = pd.concat([df['labels'].str.contains(x) for x in req], 1, keys=req).astype(int) 
print(out) 

    created destroyed tower swimming pool 
0  1   0  1    0 
1  0   1  1    0 
2  1   0  0    1 
3  0   1  0    1 
+2

不錯的一個傢伙〜:) – Wen

+0

@Wen我不相信這是最好的...也許你可以想到更酷的東西? –

+0

我覺得這很好,我正在使用的是'nltk',個人認爲我使用的那個比你的解決方案更好。 – Wen

4

在你的情況,如果打破的話是the你可以使用以下來實現它。 (PS:這是您更好地使用COLDSPEED的答案當斷字不僅The

pd.get_dummies(df['labels'].str.split('the').apply(pd.Series)) 

Out[424]: 
    0_created 0_destroyed 1_ swimming pool 1_ tower 
0   1    0     0   1 
1   0    1     0   1 
2   1    0     1   0 
3   0    1     1   0 
+1

有趣的方法...我沒有考慮使用''。我認爲.apply(pd.Series)很慢,但可以使用構造函數'pd.get_dummies(pd.DataFrame(df ['labels']。str.split('the')。values.tolist )))'現在應該快兩倍。 –

+0

@cᴏʟᴅsᴘᴇᴇᴅ是的,你是對的謝謝你! :) – Wen

+0

@文:很好的解決方案。但'''不是唯一的破字 – GKS

9
import re 
import pandas as pd 
df = pd.DataFrame({'index': [1, 2, 3, 4], 'labels': ['created the tower', 'destroyed the tower', 'created the swimming pool', 'destroyed the swimming pool']}) 

columns = ['created','destroyed','tower','swimming pool'] 
pat = '|'.join(['({})'.format(re.escape(c)) for c in columns]) 
result = (df['labels'].str.extractall(pat)).groupby(level=0).count() 
result.columns = columns 
print(result) 

產生

created destroyed tower swimming pool 
0  1   0  1    0 
1  0   1  1    0 
2  1   0  0    1 
3  0   1  0    1 

大部分是由str.extractall所做的工作:

In [808]: df['labels'].str.extractall(r'(created)|(destroyed)|(tower)|(swimming pool)') 
Out[808]: 
       0   1  2    3 
    match           
0 0  created  NaN NaN   NaN 
    1   NaN  NaN tower   NaN 
1 0   NaN destroyed NaN   NaN 
    1   NaN  NaN tower   NaN 
2 0  created  NaN NaN   NaN 
    1   NaN  NaN NaN swimming pool 
3 0   NaN destroyed NaN   NaN 
    1   NaN  NaN NaN swimming pool 

由於每個匹配都放在其自己的行上,所以可以通過執行groupby/count操作來獲得期望的結果,其中我們按索引的第一級(原始索引)進行分組。


注意,Python的re模塊具有硬編碼限制到允許的命名基團的數目:

/usr/lib/python3.4/sre_compile.py in compile(p, flags) 
    577  if p.pattern.groups > 100: 
    578   raise AssertionError(
--> 579    "sorry, but this version only supports 100 named groups" 
    580   ) 
    581 

AssertionError: sorry, but this version only supports 100 named groups 

這限制了上面使用到最大100個關鍵字extractall方法。


這裏是這表明cᴏʟᴅsᴘᴇᴇᴅ的溶液(至少對於一定範圍內的使用的情況下)可能是最快的基準:

In [76]: %timeit using_contains(ser, keywords) 
10 loops, best of 3: 63.4 ms per loop 

In [77]: %timeit using_defchararray(ser, keywords) 
10 loops, best of 3: 90.6 ms per loop 

In [78]: %timeit using_extractall(ser, keywords) 
10 loops, best of 3: 126 ms per loop 

這裏是我所使用的設置:

import string 
import numpy as np 
import pandas as pd 

def using_defchararray(ser, keywords): 
    """ 
    https://stackoverflow.com/a/46046558/190597 (piRSquared) 
    """ 
    v = ser.values.astype(str) 
    # >>> (np.core.defchararray.find(v[:, None], columns) >= 0) 
    # array([[ True, False, True, False], 
    #  [False, True, True, False], 
    #  [ True, False, False, True], 
    #  [False, True, False, True]], dtype=bool) 

    result = pd.DataFrame(
     (np.core.defchararray.find(v[:, None], keywords) >= 0).astype(int), 
     index=ser.index, columns=keywords) 
    return result 

def using_extractall(ser, keywords): 
    """ 
    https://stackoverflow.com/a/46046417/190597 (unutbu) 
    """ 
    pat = '|'.join(['({})'.format(re.escape(c)) for c in keywords]) 
    result = (ser.str.extractall(pat)).groupby(level=0).count() 
    result.columns = keywords 
    return result 

def using_contains(ser, keywords): 
    """ 
    https://stackoverflow.com/a/46046142/190597 (cᴏʟᴅsᴘᴇᴇᴅ) 
    """ 
    return (pd.concat([ser.str.contains(x) for x in keywords], 
         axis=1, keys=keywords).astype(int)) 

def make_random_str_array(letters=string.ascii_letters, strlen=10, size=100): 
    return (np.random.choice(list(letters), size*strlen) 
      .view('|U{}'.format(strlen))) 

keywords = make_random_str_array(size=99) 
arr = np.random.choice(keywords, size=(1000, 5),replace=True) 
ser = pd.Series([' '.join(row) for row in arr]) 

請務必在自己的機器上檢查基準測試,並使用類似於您的使用案例的設置。結果可能因多種因素而異,如系列的大小,ser,長度爲keywords,硬件,操作系統,NumPy版本,Pandas和Python,以及它們是如何編譯的。

+0

不錯的答案:-) – Wen

7

使用numpy.core.defchararray.findnumpy braodcasting

from numpy.core.defchararray import find 

v = df['labels'].values.astype(str) 
l = ['created','tower','destroyed','swimming pool'] 

pd.DataFrame(
    (find(v[:, None], l) >= 0).astype(int), 
    df.index, l 
) 

     created tower destroyed swimming pool 
index           
1   1  1   0    0 
2   0  1   1    0 
3   1  0   0    1 
4   0  0   1    1 

find將播出整個字符串數組,我們提供的尺寸str.find功能。 find從第一個數組返回字符串中第一個找到字符串的位置。如果未找到,則返回-1。因此,我們可以通過評估find的返回值是否大於或等於0來評估字符串是否被找到。

+0

投票的速度!!!,我會用這個爲我的堆疊模型! :) – Wen

+1

謝謝@文!在我看來,這更直觀,更優雅。它很好地矢量化循環。 – piRSquared

+0

@piRSquared:這太棒了。在我的基準測試中,您的解決方案似乎是最快的。如何逆向變換?多列到單列(忽略''') – GKS