創建字符串中的單詞字典，其值是單詞後面的單詞

我想從文本文件創建一個字典，使用每個唯一的單詞作爲關鍵字和單詞的字典，字作爲價值。例如一些看起來像這樣：創建字符串中的單詞字典，其值是單詞後面的單詞

>>>string = 'This is a string' 
>>>word_counts(string) 
{'this': {'is': 1}, 'is': {'a': 1}, 'a': {'string': 1}}

創建的唯一字的字典沒有問題，它創造了字典，我被困在下面的字值。如果有重複單詞，我不能使用list.index（）操作。除此之外，我有點不知所措。

來源

2016-08-15 Grr

其實，collections.Counter類並不總是指望的東西是最好的選擇。您可以使用collections.defaultdict：

from collections import defaultdict 

def bigrams(text): 
    words = text.strip().lower().split() 
    counter = defaultdict(lambda: defaultdict(int)) 
    for prev, current in zip(words[:-1], words[1:]): 
     counter[prev][current] += 1 
    return counter

請注意，如果您的文本包含標點符號以及，行words = text.strip().lower().split()應與words = re.findall(r'\w+', text.lower())取代。

如果你的文字是如此巨大，性能問題，您可以從itertools docs考慮pairwise配方，或者，如果你使用的itertools.izip代替zip python2。

來源

2016-08-15 03:55:09 gukoff

一個很好的解決辦法，但注意，從'zip'改爲'izip'顯著減少了字的大集合的運行時間（給出了1M的條目列表減少〜45％） – FujiApple

當然，是的情況下， python2。 – gukoff

您可以使用Counter達到你想要的東西：

from collections import Counter, defaultdict 

def get_tokens(string): 
    return string.split() # put whatever token-parsing algorithm you want here 

def word_counts(string): 
    tokens = get_tokens(string) 
    following_words = defaultdict(list) 
    for i, token in enumerate(tokens): 
     if i: 
      following_words[tokens[i - 1]].append(token) 
    return {token: Counter(words) for token, words in following_words.iteritems()} 

string = 'this is a string' 
print word_counts(string) # {'this': Counter({'is': 1}), 'a': Counter({'string': 1}), 'is': Counter({'a': 1})}

來源

2016-08-15 03:48:51 Karin

只給一個備選方案（我想其他的答案更適合您的需求），你可以使用來自itertools的pairwise配方：

from itertools import tee, izip 

def pairwise(iterable): 
    "s -> (s0,s1), (s1,s2), (s2, s3), ..." 
    a, b = tee(iterable) 
    next(b, None) 
    return izip(a, b)

這時可把功能被編碼爲：

def word_counts(string): 
    words = string.split() 
    result = defaultdict(lambda: defaultdict(int)) 
    for word1, word2 in pairwise(words): 
     result[word1][word2] += 1 
    return result

測試：

string = 'This is a string is not an int is a string' 
print word_counts(string)

主要生產：

{'a': {'string': 2}, 'string': {'is': 1}, 'This': {'is': 1}, 'is': {'a': 2, 'not': 1}, 'an': {'int': 1}, 'int': {'is': 1}, 'not': {'an': 1}}

來源

2016-08-15 04:02:04 FujiApple

創建字符串中的單詞字典，其值是單詞後面的單詞

回答

相關問題