使用Witten Bell Smoothing在nltk中使用NgramModel訓練和評估bigram/trigram分佈

我想在一組句子上訓練NgramModel，使用Witten-Bell平滑來估計看不見的ngram，然後使用它獲取由該分佈產生的測試集的對數似然性。我想要做的事情幾乎與文檔中的例子一樣：http://nltk.org/_modules/nltk/model/ngram.html，但用Witten-Bell平滑處理。下面就試圖做什麼，我想要做一些玩具代碼：使用Witten Bell Smoothing在nltk中使用NgramModel訓練和評估bigram/trigram分佈

from nltk.probability import WittenBellProbDist 
from nltk import NgramModel 

est = lambda fdist, bins: WittenBellProbDist(fdist) 
fake_train = [str(t) for t in range(3000)] 
fake_test = [str(t) for t in range(2900, 3010)] 

lm = NgramModel(2, fake_train, estimator = est) 

print lm.entropy(fake_test)

不幸的是，當我嘗試運行，我得到以下錯誤：

Traceback (most recent call last): 
    File "ngram.py", line 8, in <module> 
    lm = NgramModel(2, fake_train, estimator = est) 
    File "/usr/lib/python2.7/dist-packages/nltk/model/ngram.py", line 63, in __init__ 
    self._model = ConditionalProbDist(cfd, estimator, len(cfd)) 
    File "/usr/lib/python2.7/dist-packages/nltk/probability.py", line 2016, in __init__ 
    **factory_kw_args) 
    File "ngram.py", line 4, in <lambda> 
    est = lambda fdist, bins: WittenBellProbDist(fdist) 
    File "/usr/lib/python2.7/dist-packages/nltk/probability.py", line 1210, in __init__ 
    self._P0 = self._T/float(self._Z * (self._N + self._T)) 
ZeroDivisionError: float division by zero

是什麼造成這個錯誤？據我所知，根據文檔，我正確地使用了所有的東西，而且當我使用Lidstone而不是Witten-Bell時，這種方式正常工作。

作爲第二個問題，我有收集不相交句子的數據。我怎樣才能像使用字符串列表一樣使用句子，或者做一些相同的事情來產生相同的分佈？（也就是說，我當然可以使用一個列表，其中包含所有包含後續句子的虛擬標記的句子，但這不會產生相同的分佈。）一個地方的文檔說明允許列表的字符串列表，但後來我發現一個錯誤報告，其中的文檔據說被編輯，以反映這是不被允許的（當我只是嘗試一串字符串列表，我得到一個錯誤）。

來源

2013-03-29 DJLamar

感謝您的答案，每個人都總結。我結束了與SRILM，因爲該代碼實際上是完整的，看起來是正確的... – DJLamar

它顯然是almost 3 years已知的問題。原因ZeroDivisionError是因爲在__init__以下行，

if bins == None: 
    bins = freqdist.B() 
self._freqdist = freqdist 
self._T = self._freqdist.B() 
self._Z = bins - self._freqdist.B()

每當未指定bins參數，則默認爲None這樣self._Z真的只是freqdist.B() - freqdist.B()和

self._P0 = self._T/float(self._Z * (self._N + self._T))

降低下來，

self._P0 = freqdist.B()/0.0

此外，如果您指定bins作爲比freqdist.B()較大，在執行這行代碼的任何值，

print lm.entropy(fake_test)

你會因爲WittenBellProbDist類內收到NotImplementedError，

def discount(self): 
    raise NotImplementedError()

的discount方法顯然也prob使用， NgramModel類的logprob，因此您將無法對它們進行調用。

解決這些問題的一種方法是在不改變NLTK的情況下，將繼承WittenBellProbDist並覆蓋相關方法。

來源

2013-04-02 08:13:27 Jared

我暫時遠離NLTK的NgramModel。目前有一個平滑錯誤，導致模型在n> 1時極大地高估了可能性。這適用於所有評估者，包括WittenBellProbDist甚至LidstoneProbDist。我認爲這個錯誤已經存在幾年了，這表明這部分NLTK沒有很好的測試。

請參見： https://github.com/nltk/nltk/issues/367

來源

2013-04-09 17:27:58 afourney

這是有道理的 - 我已經注意到，條件概率給予一個標記不加起來1時總結所有可能的標記。在意識到之前我有點擔心，因爲突然之間基線已經讓我的更復雜的方法失效了，呵呵。 – DJLamar

有許多人在這個問題上取得了一些進展：語言建模軟件包是它的方式回到NLTK，但我們需要的鄉親測試一下吧！

我寫的如何使用它在this answer

來源

2016-07-06 05:16:17

使用Witten Bell Smoothing在nltk中使用NgramModel訓練和評估bigram/trigram分佈

回答

相關問題