0
我試圖從大語料庫中找到k個最常見的n元組。我已經看到很多地方提出了天真的方法 - 簡單地掃描整個語料庫並保存所有n元數的字典。有一個更好的方法嗎?有沒有更有效的方法來找到最常見的n-gram?
我試圖從大語料庫中找到k個最常見的n元組。我已經看到很多地方提出了天真的方法 - 簡單地掃描整個語料庫並保存所有n元數的字典。有一個更好的方法嗎?有沒有更有效的方法來找到最常見的n-gram?
在Python,使用NLTK:
$ wget http://norvig.com/big.txt
$ python
>>> from collections import Counter
>>> from nltk import ngrams
>>> bigtxt = open('big.txt').read()
>>> ngram_counts = Counter(ngrams(bigtxt.split(), 2))
>>> ngram_counts.most_common(10)
[(('of', 'the'), 12422), (('in', 'the'), 5741), (('to', 'the'), 4333), (('and', 'the'), 3065), (('on', 'the'), 2214), (('at', 'the'), 1915), (('by', 'the'), 1863), (('from', 'the'), 1754), (('of', 'a'), 1700), (('with', 'the'), 1656)]
在Python,天然的(見Fast/Optimize N-gram implementations in python):
>>> def ngrams(text, n=2):
... return zip(*[text[i:] for i in range(n)])
>>> ngram_counts = Counter(ngrams(bigtxt.split(), 2))
>>> ngram_counts.most_common(10)
[(('of', 'the'), 12422), (('in', 'the'), 5741), (('to', 'the'), 4333), (('and', 'the'), 3065), (('on', 'the'), 2214), (('at', 'the'), 1915), (('by', 'the'), 1863), (('from', 'the'), 1754), (('of', 'a'), 1700), (('with', 'the'), 1656)]
在利亞,見Generate ngrams with Julia
import StatsBase: countmap
import Iterators: partition
bigtxt = readstring(open("big.txt"))
ngram_counts = countmap(collect(partition(split(bigtxt), 2, 1)))
粗定時:
$ time python ngram-test.py # With NLTK.
real 0m3.166s
user 0m2.274s
sys 0m0.528s
$ time python ngram-native-test.py
real 0m1.521s
user 0m1.317s
sys 0m0.145s
$ time julia ngram-test.jl
real 0m3.573s
user 0m3.188s
sys 0m0.306s
可能的重複http://cs.stackexchange.com/questions/8972/optimal-algorithm-for-finding-all-ngrams-from-a-pre-defined-set-in-a-text – pltrdy
什麼你在比較?語料庫有多大?我認爲你可以很容易地用C++來計算ngram的數量,而不用很快的計算一個龐大的語料庫,甚至在Python中它也相當快=) – alvas
你的意思是character ngrams或者word ngrams? – alvas