熊貓生成字符串

我有一個CSV文件的單個列列這是這樣的：熊貓生成字符串

index,labels 
1,created the tower 
2,destroyed the tower 
3,created the swimming pool 
4,destroyed the swimming pool

現在，如果我通過我想到位標籤列的列的列表（不包含所有在標籤欄的話）

['created','tower','destroyed','swimming pool']

我想獲得的數據框：

index,created,destroyed,tower,swimming pool 
1,1,0,1,0 
2,0,1,1,0 
3,1,0,0,1 
4,0,1,0,1

我看着get_du mmies，但沒有幫助

來源

2017-09-05 GKS

您可以在循環中調用str.contains。

print(df) 

         labels 
0   created the tower 
1   destroyed the tower 
2 created the swimming pool 
3 destroyed the swimming pool 

req = ['created', 'destroyed', 'tower', 'swimming pool'] 

out = pd.concat([df['labels'].str.contains(x) for x in req], 1, keys=req).astype(int) 
print(out) 

    created destroyed tower swimming pool 
0  1   0  1    0 
1  0   1  1    0 
2  1   0  0    1 
3  0   1  0    1

來源

2017-09-05 02:23:31

不錯的一個傢伙〜:) – Wen

@Wen我不相信這是最好的...也許你可以想到更酷的東西？ –

我覺得這很好，我正在使用的是'nltk'，個人認爲我使用的那個比你的解決方案更好。 – Wen

在你的情況，如果打破的話是the你可以使用以下來實現它。（PS：這是您更好地使用COLDSPEED的答案當斷字不僅The）

pd.get_dummies(df['labels'].str.split('the').apply(pd.Series)) 

Out[424]: 
    0_created 0_destroyed 1_ swimming pool 1_ tower 
0   1    0     0   1 
1   0    1     0   1 
2   1    0     1   0 
3   0    1     1   0

來源

2017-09-05 03:00:49 Wen

有趣的方法...我沒有考慮使用''。我認爲.apply（pd.Series）很慢，但可以使用構造函數'pd.get_dummies（pd.DataFrame（df ['labels']。str.split（'the'）。values.tolist ）））'現在應該快兩倍。 –

@cᴏʟᴅsᴘᴇᴇᴅ是的，你是對的謝謝你！ :) – Wen

@文：很好的解決方案。但'''不是唯一的破字 – GKS

import re 
import pandas as pd 
df = pd.DataFrame({'index': [1, 2, 3, 4], 'labels': ['created the tower', 'destroyed the tower', 'created the swimming pool', 'destroyed the swimming pool']}) 

columns = ['created','destroyed','tower','swimming pool'] 
pat = '|'.join(['({})'.format(re.escape(c)) for c in columns]) 
result = (df['labels'].str.extractall(pat)).groupby(level=0).count() 
result.columns = columns 
print(result)

產生

created destroyed tower swimming pool 
0  1   0  1    0 
1  0   1  1    0 
2  1   0  0    1 
3  0   1  0    1

大部分是由str.extractall所做的工作：

In [808]: df['labels'].str.extractall(r'(created)|(destroyed)|(tower)|(swimming pool)') 
Out[808]: 
       0   1  2    3 
    match           
0 0  created  NaN NaN   NaN 
    1   NaN  NaN tower   NaN 
1 0   NaN destroyed NaN   NaN 
    1   NaN  NaN tower   NaN 
2 0  created  NaN NaN   NaN 
    1   NaN  NaN NaN swimming pool 
3 0   NaN destroyed NaN   NaN 
    1   NaN  NaN NaN swimming pool

由於每個匹配都放在其自己的行上，所以可以通過執行groupby/count操作來獲得期望的結果，其中我們按索引的第一級（原始索引）進行分組。

注意，Python的re模塊具有硬編碼限制到允許的命名基團的數目：

/usr/lib/python3.4/sre_compile.py in compile(p, flags) 
    577  if p.pattern.groups > 100: 
    578   raise AssertionError(
--> 579    "sorry, but this version only supports 100 named groups" 
    580   ) 
    581 

AssertionError: sorry, but this version only supports 100 named groups

這限制了上面使用到最大100個關鍵字的extractall方法。

這裏是這表明cᴏʟᴅsᴘᴇᴇᴅ的溶液（至少對於一定範圍內的使用的情況下）可能是最快的基準：

In [76]: %timeit using_contains(ser, keywords) 
10 loops, best of 3: 63.4 ms per loop 

In [77]: %timeit using_defchararray(ser, keywords) 
10 loops, best of 3: 90.6 ms per loop 

In [78]: %timeit using_extractall(ser, keywords) 
10 loops, best of 3: 126 ms per loop

這裏是我所使用的設置：

import string 
import numpy as np 
import pandas as pd 

def using_defchararray(ser, keywords): 
    """ 
    https://stackoverflow.com/a/46046558/190597 (piRSquared) 
    """ 
    v = ser.values.astype(str) 
    # >>> (np.core.defchararray.find(v[:, None], columns) >= 0) 
    # array([[ True, False, True, False], 
    #  [False, True, True, False], 
    #  [ True, False, False, True], 
    #  [False, True, False, True]], dtype=bool) 

    result = pd.DataFrame(
     (np.core.defchararray.find(v[:, None], keywords) >= 0).astype(int), 
     index=ser.index, columns=keywords) 
    return result 

def using_extractall(ser, keywords): 
    """ 
    https://stackoverflow.com/a/46046417/190597 (unutbu) 
    """ 
    pat = '|'.join(['({})'.format(re.escape(c)) for c in keywords]) 
    result = (ser.str.extractall(pat)).groupby(level=0).count() 
    result.columns = keywords 
    return result 

def using_contains(ser, keywords): 
    """ 
    https://stackoverflow.com/a/46046142/190597 (cᴏʟᴅsᴘᴇᴇᴅ) 
    """ 
    return (pd.concat([ser.str.contains(x) for x in keywords], 
         axis=1, keys=keywords).astype(int)) 

def make_random_str_array(letters=string.ascii_letters, strlen=10, size=100): 
    return (np.random.choice(list(letters), size*strlen) 
      .view('|U{}'.format(strlen))) 

keywords = make_random_str_array(size=99) 
arr = np.random.choice(keywords, size=(1000, 5),replace=True) 
ser = pd.Series([' '.join(row) for row in arr])

請務必在自己的機器上檢查基準測試，並使用類似於您的使用案例的設置。結果可能因多種因素而異，如系列的大小，ser，長度爲keywords，硬件，操作系統，NumPy版本，Pandas和Python，以及它們是如何編譯的。

來源

2017-09-05 03:03:20 unutbu

不錯的答案:-) – Wen

使用numpy.core.defchararray.find和numpy braodcasting

from numpy.core.defchararray import find 

v = df['labels'].values.astype(str) 
l = ['created','tower','destroyed','swimming pool'] 

pd.DataFrame(
    (find(v[:, None], l) >= 0).astype(int), 
    df.index, l 
) 

     created tower destroyed swimming pool 
index           
1   1  1   0    0 
2   0  1   1    0 
3   1  0   0    1 
4   0  0   1    1

find將播出整個字符串數組，我們提供的尺寸str.find功能。 find從第一個數組返回字符串中第一個找到字符串的位置。如果未找到，則返回-1。因此，我們可以通過評估find的返回值是否大於或等於0來評估字符串是否被找到。

來源

2017-09-05 03:22:49 piRSquared

投票的速度!!!，我會用這個爲我的堆疊模型！ :) – Wen

謝謝@文！在我看來，這更直觀，更優雅。它很好地矢量化循環。 – piRSquared

@piRSquared：這太棒了。在我的基準測試中，您的解決方案似乎是最快的。如何逆向變換？多列到單列（忽略'''） – GKS

熊貓生成字符串

回答

相關問題