試圖從大熊貓的列中刪除標點符號

這是我用來從pandas的列中刪除標點符號的函數。試圖從大熊貓的列中刪除標點符號

def remove_punctuation(text): 
    return re.sub(r'[^\w\s]','',text)

這就是我應用它的方式。

review_without_punctuation = products['review'].apply(remove_punctuation)

這裏的產品是pandas數據幀。

這是我得到的錯誤信息。

TypeError         Traceback (most recent call last) 
<ipython-input-19-196c188dfb67> in <module>() 
----> 1 review_without_punctuation = products['review'].apply(remove_punctuation) 

/Users/username/Dropbox/workspace/private/pydev/ml/classification/.env/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds) 
    2292    else: 
    2293     values = self.asobject 
-> 2294     mapped = lib.map_infer(values, f, convert=convert_dtype) 
    2295 
    2296   if len(mapped) and isinstance(mapped[0], Series): 

pandas/src/inference.pyx in pandas.lib.map_infer (pandas/lib.c:66124)() 

<ipython-input-18-0950dc65d8b8> in remove_punctuation(text) 
     1 def remove_punctuation(text): 
----> 2  return re.sub(r'[^\w\s]','',text) 

/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/re.py in sub(pattern, repl, string, count, flags) 
    189  a callable, it's passed the match object and must return 
    190  a replacement string to be used.""" 
--> 191  return _compile(pattern, flags).sub(repl, string, count) 
    192 
    193 def subn(pattern, repl, string, count=0, flags=0): 

TypeError: expected string or bytes-like object

我在做什麼錯。

來源

2017-03-19 Melissa Stewart

給我們一個小例子請DataFrame。 – Denziloe

您可以檢查列「review」的任何一行中是否有'nan'或非字符串值？ – Ali

您應該儘量避免在Pandas中通過apply()運行純Python代碼。這很慢。相反，使用它存在於每一個Pandas string series特殊str property：

In [9]: s = pd.Series(['hello', 'a,b,c', 'hmm...']) 
In [10]: s.str.replace(r'[^\w\s]', '') 
Out[10]: 
0 hello 
1  abc 
2  hmm 
dtype: object

來源

2017-03-19 01:50:54

請你詳細說明你的'避免通過apply（）在Pandas'中運行純Python代碼嗎？我一直認爲這是最合適的方式，因爲它是矢量化和最直接的。 –

@SergeyBushmanov：對我來說，「矢量化」的意思是「沒有循環單獨執行每行的Python代碼。」所有'apply（）'都是在每一行上分別運行Python代碼。而且速度很慢。這只是做一些你不應該做的事情的好方法。如果你將'ufunc'（一種非常規Python函數的特殊函數）傳遞給'apply（）'，那麼它就是真正的「矢量化」（即快速）。 –

，因爲你的apply被錯誤地應用它不工作。

做正確的方法是：

import re 
s = pd.Series(['hello', 'a,b,c', 'hmm...']) 
s.apply(lambda x: re.sub(r'[^\w\s]', '',x)) 
0 hello 
1  abc 
2  hmm 
dtype: object

（帽尖@約翰Zwinck爲regex）

這相較於另一種解決方案：

%timeit s.apply(lambda x: re.sub(r'[^\w\s]', '',x)) 
%timeit s.str.replace(r'[^\w\s]', '') 
1000 loops, best of 3: 275 µs per loop 
1000 loops, best of 3: 310 µs per loop

來源

2017-03-19 03:05:18

當系列很長時，'str.replace（）'方法比'apply（）'快。但不是太多，因爲所應用的功能是微不足道的。 –

試圖從大熊貓的列中刪除標點符號

回答

相關問題