在列表詞典上使用TfidfVectorizer

我有一個大的語料庫存儲爲25個列表的字典，我想用SKLearn的TfidfVectorizer來分析。每個列表包含許多字符串。現在，我關心整個語料庫中的總體詞頻（tf）和25個字符串（idf）的每個列表中最獨特的詞。問題是，我還沒有找到將這種對象傳遞給TfidfVectorizer的方法。通過字典只是矢量化的關鍵，傳遞值產生AttributeError: 'list' object has no attribute 'lower'（我想它預計字符串。）在列表詞典上使用TfidfVectorizer

在此先感謝。

更新：現在包括我的預處理步驟，其使用面積的dict，ID對所謂buckets

for area in buckets: 
    area_docs = [] 
    for value in buckets[area]: 
     if 'filename_%s.txt' % value in os.listdir(directory): 
      fin = open(directory+'/filename_%s.txt' % value, 'r').read() 
      area_docs.append(fin) 
      buckets[area] = area_docs 



corpus = buckets.values() 
vectorizer = TfidfVectorizer(min_df=1, stop_words='english') 
X = vectorizer.fit_transform(corpus) 
idf = vectorizer.idf_ 
d = dict(zip(vectorizer.get_feature_names(), idf)) 
sorted_d = sorted(d.items(), key=operator.itemgetter(1)) 
sorted_d[:50]

來源

2017-05-31 6Bacon

TfidfVectorizer用於將原始文檔集合轉換爲TF-IDF特徵矩陣。它想要一系列文件。你的字典似乎被以某種方式處理，所以目前還不清楚你希望'TfidfVectorizer'做什麼。 –

謝謝@ juanpa.arrivillaga。編輯以反映列表項目是多字符串（在我的實際案例中〜2000字）。這些清單基本上都是亞文庫。實際上，我想知道給定子小組（列表）中最有特色的單詞。 – 6Bacon

TfidfVectorizer想要的字符串，每個字符串的文件列表。你的area_docs變量已經是一個字符串列表，所以當你調用buckets.values()時，你會得到一串字符串列表，這對TfidfVectorizer來說太多了。你需要將這個列表弄平。下面的代碼是在Python3中，只改變了一行，並添加了另一個新行：

for area in buckets: 
    area_docs = [] 
    for value in buckets[area]: 
     if 'filename_%s.txt' % value in os.listdir(directory): 
      fin = open(directory+'/filename_%s.txt' % value, 'r').read() 
      area_docs.append(fin) 
      buckets[area] = area_docs 

corpus = list(buckets.values()) # Get your list of lists of strings 
corpus = sum(corpus, []) # Good trick for flattening 2D lists to 1D 
vectorizer = TfidfVectorizer(min_df=1, stop_words='english') 
X = vectorizer.fit_transform(corpus) 
idf = vectorizer.idf_ 
d = dict(zip(vectorizer.get_feature_names(), idf)) 
sorted_d = sorted(d.items(), key=operator.itemgetter(1)) 
sorted_d[:50]

這應該這樣做！

來源

2017-05-31 15:59:17

在列表詞典上使用TfidfVectorizer

回答

相關問題