import re
import pandas as pd
df = pd.DataFrame({'index': [1, 2, 3, 4], 'labels': ['created the tower', 'destroyed the tower', 'created the swimming pool', 'destroyed the swimming pool']})
columns = ['created','destroyed','tower','swimming pool']
pat = '|'.join(['({})'.format(re.escape(c)) for c in columns])
result = (df['labels'].str.extractall(pat)).groupby(level=0).count()
result.columns = columns
print(result)
產生
created destroyed tower swimming pool
0 1 0 1 0
1 0 1 1 0
2 1 0 0 1
3 0 1 0 1
大部分是由str.extractall
所做的工作:
In [808]: df['labels'].str.extractall(r'(created)|(destroyed)|(tower)|(swimming pool)')
Out[808]:
0 1 2 3
match
0 0 created NaN NaN NaN
1 NaN NaN tower NaN
1 0 NaN destroyed NaN NaN
1 NaN NaN tower NaN
2 0 created NaN NaN NaN
1 NaN NaN NaN swimming pool
3 0 NaN destroyed NaN NaN
1 NaN NaN NaN swimming pool
由於每個匹配都放在其自己的行上,所以可以通過執行groupby/count
操作來獲得期望的結果,其中我們按索引的第一級(原始索引)進行分組。
注意,Python的re
模塊具有硬編碼限制到允許的命名基團的數目:
/usr/lib/python3.4/sre_compile.py in compile(p, flags)
577 if p.pattern.groups > 100:
578 raise AssertionError(
--> 579 "sorry, but this version only supports 100 named groups"
580 )
581
AssertionError: sorry, but this version only supports 100 named groups
這限制了上面使用到最大100個關鍵字的extractall
方法。
這裏是這表明cᴏʟᴅsᴘᴇᴇᴅ的溶液(至少對於一定範圍內的使用的情況下)可能是最快的基準:
In [76]: %timeit using_contains(ser, keywords)
10 loops, best of 3: 63.4 ms per loop
In [77]: %timeit using_defchararray(ser, keywords)
10 loops, best of 3: 90.6 ms per loop
In [78]: %timeit using_extractall(ser, keywords)
10 loops, best of 3: 126 ms per loop
這裏是我所使用的設置:
import string
import numpy as np
import pandas as pd
def using_defchararray(ser, keywords):
"""
https://stackoverflow.com/a/46046558/190597 (piRSquared)
"""
v = ser.values.astype(str)
# >>> (np.core.defchararray.find(v[:, None], columns) >= 0)
# array([[ True, False, True, False],
# [False, True, True, False],
# [ True, False, False, True],
# [False, True, False, True]], dtype=bool)
result = pd.DataFrame(
(np.core.defchararray.find(v[:, None], keywords) >= 0).astype(int),
index=ser.index, columns=keywords)
return result
def using_extractall(ser, keywords):
"""
https://stackoverflow.com/a/46046417/190597 (unutbu)
"""
pat = '|'.join(['({})'.format(re.escape(c)) for c in keywords])
result = (ser.str.extractall(pat)).groupby(level=0).count()
result.columns = keywords
return result
def using_contains(ser, keywords):
"""
https://stackoverflow.com/a/46046142/190597 (cᴏʟᴅsᴘᴇᴇᴅ)
"""
return (pd.concat([ser.str.contains(x) for x in keywords],
axis=1, keys=keywords).astype(int))
def make_random_str_array(letters=string.ascii_letters, strlen=10, size=100):
return (np.random.choice(list(letters), size*strlen)
.view('|U{}'.format(strlen)))
keywords = make_random_str_array(size=99)
arr = np.random.choice(keywords, size=(1000, 5),replace=True)
ser = pd.Series([' '.join(row) for row in arr])
請務必在自己的機器上檢查基準測試,並使用類似於您的使用案例的設置。結果可能因多種因素而異,如系列的大小,ser
,長度爲keywords
,硬件,操作系統,NumPy版本,Pandas和Python,以及它們是如何編譯的。
不錯的一個傢伙〜:) – Wen
@Wen我不相信這是最好的...也許你可以想到更酷的東西? –
我覺得這很好,我正在使用的是'nltk',個人認爲我使用的那個比你的解決方案更好。 – Wen