NLTK可以很容易地計算單詞的大小寫。那麼字母呢？

我已經在網上看到大量關於python NLTK如何輕鬆計算單詞大小的文檔。NLTK可以很容易地計算單詞的大小寫。那麼字母呢？

那麼字母呢？

我想要做的是插入一本字典，並告訴我不同字母對的相對頻率。

最終我想做一些馬爾科夫過程來生成可能看起來（但是假的）單詞。

2013-01-05 isthmuses

你可以做的是簡單地把你的話的字符串，但通過信函，而不是由字有你的標記生成器記號化，然後運行你在信令標記集上的bigram模型。 – jdotjdot

下面是一個例子使用計數器從collections模塊（模相對頻率分佈）：

#!/usr/bin/env python 

import sys 
from collections import Counter 
from itertools import islice 
from pprint import pprint 

def split_every(n, iterable): 
    i = iter(iterable) 
    piece = ''.join(list(islice(i, n))) 
    while piece: 
     yield piece 
     piece = ''.join(list(islice(i, n))) 

def main(text): 
    """ return ngrams for text """ 
    freqs = Counter() 
    for pair in split_every(2, text): # adjust n here 
     freqs[pair] += 1 
    return freqs 

if __name__ == '__main__': 
    with open(sys.argv[1]) as handle: 
     freqs = main(handle.read()) 
     pprint(freqs.most_common(10))

用法：

$ python 14168601.py lorem.txt 
[('t ', 32), 
(' e', 20), 
('or', 18), 
('at', 16), 
(' a', 14), 
(' i', 14), 
('re', 14), 
('e ', 14), 
('in', 14), 
(' c', 12)]

來源

2013-01-05 04:44:07 miku

如果二元語法是所有你需要，你不需要NLTK 。你可以簡單地做如下：

from collections import Counter 
text = "This is some text" 
bigrams = Counter(x+y for x, y in zip(*[text[i:] for i in range(2)])) 
for bigram, count in bigrams.most_common(): 
    print bigram, count

輸出：

is 2 
s 2 
me 1 
om 1 
te 1 
t 1 
i 1 
e 1 
s 1 
hi 1 
so 1 
ex 1 
Th 1 
xt 1

來源

2013-01-06 14:21:17 vpekar

NLTK可以很容易地計算單詞的大小寫。那麼字母呢？

回答

相關問題