Python和nGrams

Aster用戶在這裏試圖完全移動到python的基本文本分析。我想在Python中使用nltk或其他模塊複製ASTER ngram的輸出。我需要能夠爲1到4的ngram做到這一點。輸出到csv。Python和nGrams

DATA：

Unique_ID, Text_Narrative

OUTPUT需要：

Unique_id, ngram(token), ngram(frequency)

輸出示例：

023345 「I」 1
023345 「愛」 1
023345 「巨蟒」 1

來源

2017-08-14 Josh Chilton

嗨，歡迎來到SO，你能包括一些你想要的代碼嗎？主要問題是什麼？ –

我們不是一個編碼服務。請告訴我們你做了什麼以及你卡在哪裏。 –

你需要使用'open'或'csv.writer'作爲文件寫入的東西，然後我會推薦'''''''''''Counter'，這就是它。你想要unique_ID字符串內的頻率還是一起？ –

我寫了這個簡單的版本只python標準庫，爲教育的原因。

生產代碼應使用spacy和pandas

import collections 
from operator import itemgetter as at 
with open("input.csv",'r') as f: 
    data = [l.split(',', 2) for l in f.readlines()] 
spaced = lambda t: (t[0][0],' '.join(map(at(1), t))) if t[0][0]==t[1][0] else [] 
unigrams = [(i,w) for i, d in data for w in d.split()] 
bigrams = filter(any, map(spaced, zip(unigrams, unigrams[1:]))) 
trigrams = filter(any, map(spaced, zip(unigrams, unigrams[1:], unigrams[2:]))) 
with open("output.csv", 'w') as f: 
    for ngram in [unigrams, bigrams, trigrams]: 
     counts = collections.Counter(ngram) 
     for t,count in counts.items(): 
      f.write("{i},{w},{c}\n".format(c=count, i=t[0], w=t[1]))

來源

2017-08-14 15:23:58

謝謝Uri-這段代碼讓我在中途得到了一半。你可以分享一下這樣的調整：我將運行一個2字，3字等的ngram字樣嗎？ –

我已經添加了bigrams和trigrams計算，如果有幫助請接受答案。如果您有任何其他要求，請提出一個新問題 –

正如有人說真正的問題是模糊的，但因爲你是新來的是一個漫長的形式引導。 :-)

from collections import Counter 

#Your starting input - a phrase with an ID 
#I added some extra words to show count 
dict1 = {'023345': 'I love Python love Python Python'} 


#Split the dict vlue into a list for counting 
dict1['023345'] = dict1['023345'].split() 

#Use counter to count 
countlist = Counter(dict1['023345']) 

#count list is now "Counter({'I': 1, 'Python': 1, 'love': 1})" 

#If you want to output it like you requested, interate over the dict 
for key, value in dict1.iteritems(): 
    id1 = key 
    for key, value in countlist.iteritems(): 
     print id1, key, value

來源

2017-11-10 23:32:24

回答

相關問題