如何計算字典的tf-idf列表？

我有一個文本列表，其中每個文本存儲爲一個字典，其ID爲鍵和文本數據作爲其值。如何計算此數據的tf-idf。例如：如何計算字典的tf-idf列表？

{1: 'This is cat', 2: 'Is this the first document?', 3: 'And the third one.'}

來源

2015-05-03 Charul

你能告訴我們你試過了什麼，出了什麼問題？ –

先轉換你的字典成的字符串列表：

X_all = list(d.values())

構建tfIDFVectoriser功能：

from sklearn.feature_extraction.text import TfidfVectorizer 

    tfv = TfidfVectorizer(min_df=3, max_features=None, 
    strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}', 
    ngram_range=(1,2), use_idf=1,smooth_idf=1,sublinear_tf=1, 
    stop_words = 'english')

，然後你可以建立自己的模型：

X_all = tfv.transform(X_all)

其中X_all是文本文檔的列表。

來源

2015-05-03 11:25:38 Ayusek

如何計算字典的tf-idf列表？

回答

相關問題