-1
我有一個字符串列表,它有重複的值,我想創建單詞字典,其中鍵將是單詞,其值將是頻率計數,然後寫下這些文字和它們的值在CSV:將重複值的列表轉換爲Python中的頻率計數字典
以下是我的方式做同樣的:
#!/usr/bin/env python
# encoding: utf-8
# -*- coding: utf8 -*-
import csv
from nltk.tokenize import TweetTokenizer
import numpy as np
tknzr = TweetTokenizer()
#print tknzr.tokenize(s0)
with open("dispn.csv","r") as file1,\
open("dispn_tokenized.csv","w") as file2,\
open("dispn_tokenized_count.csv","w") as file3:
mycsv = list(csv.reader(file1))
words = []
words_set = []
tokenize_count = {}
for row in mycsv:
lst = tknzr.tokenize(row[2])
for l in lst:
file2.write("\""+str(row[2])+"\""+","+"\""+str(l.encode('utf-8'))+"\""+"\n")
l = l.lower()
words.append(l)
words_set = list(set(words))
print "len of words_set : " + str(len(words_set))
for word in words_set:
tokenize_count[word] = 1
for word in words:
tokenize_count[word] = tokenize_count[word]+1
print "len of tokenized words_set : " + str(len(tokenize_count))
#print "Tokenized_words count : "
#print tokenize_count
#print "================================================================="
i = 0
for wrd in words_set:
#i = i+1
print "i : " +str(i)
file3.write("\""+str(i)+"\""+","+"\""+str(wrd.encode('utf-8'))+"\""+","+"\""+str(tokenize_count[wrd])+"\""+"\n")
但在CSV我還是發現了像1,5,4,7一些重複值,9。
的方法的一些信息:
- dispn.csv = contains usernames of the users
which i am tokenizing with the help of nltk module
- after tokenizing them, i am storing these words in the list 'words'
and writing the words corresponding to the username to csv.
- creating set of it so as to get unique values out of list 'words'
and storing it in 'words_set'
- then creating dictionary 'tokenize_count' with key as word and
value as its frequency count and writing the same to csv.
爲什麼會收到只有一些數值的重複?有沒有更好的方法來做同樣的事情?請幫忙。
['進口從collections'計數器](https://開頭docs.python.org/2/library/collections.html#collections.Counter) –
[如何計算列表中元素的頻率?](http://stackoverflow.com/questions/2161752/how這個元素在列表中的頻率) – alfasin
@RNar:你可以發表你的評論作爲答案,以便我會接受它嗎?感謝它解決了我的問題 –