我在測試的virtualenv這和它的作品:
In [20]: from nltk import bigrams
In [21]: bigrams('This is a test')
Out[21]:
[('T', 'h'),
('h', 'i'),
('i', 's'),
('s', ' '),
(' ', 'i'),
('i', 's'),
('s', ' '),
(' ', 'a'),
('a', ' '),
(' ', 't'),
('t', 'e'),
('e', 's'),
('s', 't')]
這是唯一的錯誤你得到?
順便說一句,作爲你的第二個問題:
from collections import Counter
In [44]: b = bigrams('This is a test')
In [45]: Counter(b)
Out[45]: Counter({('i', 's'): 2, ('s', ' '): 2, ('a', ' '): 1, (' ', 't'): 1, ('e', 's'): 1, ('h', 'i'): 1, ('t', 'e'): 1, ('T', 'h'): 1, (' ', 'i'): 1, (' ', 'a'): 1, ('s', 't'): 1})
對於話:
In [49]: b = bigrams("This is a test".split(' '))
In [50]: b
Out[50]: [('This', 'is'), ('is', 'a'), ('a', 'test')]
In [51]: Counter(b)
Out[51]: Counter({('is', 'a'): 1, ('a', 'test'): 1, ('This', 'is'): 1})
這種分裂的話顯然是很膚淺的,但取決於你的應用程序可能就足夠了。顯然,你可以使用nltk的標記化,這是非常複雜的。
爲了實現自己的最終目標,你可以做這樣的事情:
In [56]: d = "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."
In [56]: from nltk import trigrams
In [57]: tri = trigrams(d.split(' '))
In [60]: counter = Counter(tri)
In [61]: import random
In [62]: random.sample(counter, 5)
Out[62]:
[('Ipsum', 'has', 'been'),
('industry.', 'Lorem', 'Ipsum'),
('Ipsum', 'passages,', 'and'),
('was', 'popularised', 'in'),
('galley', 'of', 'type')]
我修剪的輸出,因爲它是不必要的大,但你的想法。
感謝您的迴應,我不知道我做了什麼,但它現在正在導入..嗯...現在的問題是,我需要每個字母bigrams不是每封信,以便我可以根據每個字計算..如何我可以這樣做嗎?還需要弄清楚,甚至需要根據類似於原始語料庫的ngram生成隨機文本,然後根據關鍵字bigrams(以及tri和quad)進行計算。 – user1378618
更新...請參閱我的答案。 –
我現在看到了。不幸的是,上面在程序中使用的「新聞」不是我可以使用的.split類型。我得到的錯誤:AttributeError:'ConcatenatedCorpusView'對象沒有'split'屬性我怎樣才能使用我的改變版本的新聞與註釋,並用它分離成bigrams,tri等?編輯:好吧讓我看看這裏一秒 – user1378618