2016-03-05 32 views
0

我遵循NLP教程here(6'58'') - 關於愚蠢退避平滑算法的部分。 在視頻教程和implementation of bi-gram level stupid-backoff,他們使用的折扣值= 0.4愚蠢退避中的折扣值

Tutorial Slide

實現兩字級的退避:

def score(self, sentence): 
    score = 0.0 
    previous = sentence[0] 
    for token in sentence[1:]: 
     bicount = self.bigramCounts[(previous, token)] 
     bi_unicount = self.unigramCounts[previous] 
     unicount = self.unigramCounts[token] 
     if bicount > 0: 
      score += math.log(bicount) 
      score -= math.log(bi_unicount) 
     else: 
      score += math.log(0.4)  // discount here 
      score += math.log(unicount + 1) 
      score -= math.log(self.total + self.vocab_size) 
     previous = token 
    return score 

但隨後trigram-level implementation,貼現值是1

def score(self, sentence): 
    score = 0.0 
    fst = sentence[0] 
    snd = sentence[1] 
    for token in sentence[2:]: 
     tricount = self.trigramCounts[(fst, snd, token)] 
     tri_bicount = self.bigramCounts[(fst, snd)] 
     bicount = self.bigramCounts[(snd, token)] 
     bi_unicount = self.unigramCounts[snd] 
     unicount = self.unigramCounts[token] 
     if tricount > 0: 
      score += math.log(tricount) 
      score -= math.log(tri_bicount) 
     elif bicount > 0: 
      score += math.log(bicount)    // no discount here 
      score -= math.log(bi_unicount) 
     else: 
      score += math.log((unicount + 1))  // no discount here 
      score -= math.log(self.total + self.vocab_size) 
     fst, snd = snd, token 
    return score 

當我跑project - 與折扣設置0.4和1的三克的水平,我得到的分數:

tri-gram with discount = 0.4 < bi-gram with discount = 0.4 < tri-gram with discount =1

這很容易知道爲什麼 - 有折扣= 0.4,成爲三克的最終else

else: 
    score += math.log(0.4)  // -> -0.3979 
    score += math.log(0.4)  // -> -0.3979 
    score += math.log((unicount + 1))  // no discount here 
    score -= math.log(self.total + self.vocab_size) 

所以我真的很困惑 - 0.4值是從哪裏來的?

+0

0.4在愚蠢的退避? – user3639557

+0

@ user3639557是的,但我不知道爲什麼它是0.4,爲什麼在trigram例子中,他們不使用這個折扣。 – user3448806

+0

這是非常隨意的,這就是爲什麼他們把它稱爲愚蠢回退。閱讀以下答案中引用的論文。 – user3639557

回答