目前我正在通過計算600個文件(400封電子郵件和200封垃圾郵件)中的單詞出現次數來嘗試處理lingspam dataset
。我已經通過Porter Stemmer
Aglorithm使每個單詞具有通用性,我還希望我的結果能夠在每個文件中進行標準化處理以便進一步處理。但我我如何能做到這一點不確定..如何將項目添加到collection.Counter?然後將它們分類爲ASC?
資源迄今
- 8.3. collections — Container datatypes
- How to count co-ocurrences with collections.Counter() in python?
- Bag of Words model
爲了得到下面的輸出我需要能夠按升序添加文件內可能不存在的項目。
printing from ./../lingspam_results/spmsgb164.txt.out
[('money', 0, 'univers', 0, 'sales', 0)]
printing from ./../lingspam_results/spmsgb166.txt.out
[('money', 2, 'univers', 0, 'sales', 0)]
printing from ./../lingspam_results/spmsgb167.txt.out
[('money', 0, 'univers', 0, 'sales', 1)]
然後我打算轉換成vectors
使用numpy
。
[0,0,0]
[2,0,0]
[0,0,0]
,而不是..
printing from ./../lingspam_results/spmsgb165.txt.out
[]
printing from ./../lingspam_results/spmsgb166.txt.out
[('univers', 2)]
printing from ./../lingspam_results/spmsgb167.txt.out
[('sale', 1)]
我怎樣才能規範從Counter
模塊我的成果轉化爲Ascending Order
(同時也增加項目的計數結果,可能無法從我的search_list
存在)?我已經嘗試了下面的一些內容,只是從每個文本文件中讀取並根據search_list
創建一個列表。
import numpy as np, os
from collections import Counter
def parse_bag(directory, search_list):
words = []
for (dirpath, dirnames, filenames) in os.walk(directory):
for f in filenames:
path = directory + "/" + f
count_words(path, search_list)
return;
def count_words(filename, search_list):
textwords = open(filename, 'r').read().split()
filteredwords = [t for t in textwords if t in search_list]
wordfreq = Counter(filteredwords).most_common(5)
print "printing from " + filename
print wordfreq
search_list = ['sale', 'univers', 'money']
parse_bag("./../lingspam_results", search_list)
感謝
究竟做些什麼你的意思是「按升序排列」?你不是在談論你的'search_list'字樣的字母順序,是嗎? –
或者您希望每個文件的項目按照其在所有文件中的總體頻率排序? –
我在說'wordfreq = Counter(filteredwords).most_common(5)'的結果是'升序',而不是哪個單詞出現的次序最多。 – Killrawr