考慮以下演示:
來源DF:
In [2]: df
Out[2]:
text
0 is it good movie
1 wooow is it very goode
2 bad movie
解決方案:讓我們創建一個SparseDataFrame了TFIDF稀疏矩陣:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', stop_words='english')
sdf = pd.SparseDataFrame(vect.fit_transform(df['text']),
columns=vect.get_feature_names(),
default_fill_value=0)
sdf['text'] = df['text']
結果:
In [13]: sdf
Out[13]:
bad good goode wooow text
0 0.0 1.0 0.000000 0.000000 is it good movie
1 0.0 0.0 0.707107 0.707107 wooow is it very goode
2 1.0 0.0 0.000000 0.000000 bad movie
In [14]: sdf.memory_usage()
Out[14]:
Index 80
bad 8
good 8
goode 8
wooow 8
text 24
dtype: int64
P. S在.memory_usage()
注意 - 我們沒有失去「空閒」。如果我們將使用pd.concat
,join
,等 - 我們將失去「稀疏性」,因爲所有這些方法都會生成一個新的常規(未稀疏)合併的DataFrame的副本