2015-04-25 74 views
5

Moses是構建機器翻譯模型的軟件。而KenLM是摩西使用的事實上的語言模型軟件。如何調整具有巨大語言模型的機器翻譯模型?

我有文字的16GB的一個文本,我用它來建立一個語言模型這樣:

bin/lmplz -o 5 <text > text.arpa 

產生的文件(text.arpa)爲38GB。然後,我二值化的語言模型這樣:

bin/build_binary text.arpa text.binary 

和二值語言模型(text.binary)增長到71GB。

moses中,在訓練翻譯模型之後,您應該使用MERT算法調整模型的權重。這可以簡單地通過https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/mert-moses.pl完成。

MERT可以在小語言模式下正常工作,但對於大型語言模型而言,完成需要相當多的時間。

我做了谷歌搜索,發現KenLM的過濾器,這將語言模型篩選到一個更小的尺寸:https://kheafield.com/code/kenlm/filter/

但我是無能,如何使它工作。該命令的幫助,得到:

$ ~/moses/bin/filter 
Usage: /home/alvas/moses/bin/filter mode [context] [phrase] [raw|arpa] [threads:m] [batch_size:m] (vocab|model):input_file output_file 

copy mode just copies, but makes the format nicer for e.g. irstlm's broken 
    parser. 
single mode treats the entire input as a single sentence. 
multiple mode filters to multiple sentences in parallel. Each sentence is on 
    a separate line. A separate file is created for each sentence by appending 
    the 0-indexed line number to the output file name. 
union mode produces one filtered model that is the union of models created by 
    multiple mode. 

context means only the context (all but last word) has to pass the filter, but 
    the entire n-gram is output. 

phrase means that the vocabulary is actually tab-delimited phrases and that the 
    phrases can generate the n-gram when assembled in arbitrary order and 
    clipped. Currently works with multiple or union mode. 

The file format is set by [raw|arpa] with default arpa: 
raw means space-separated tokens, optionally followed by a tab and arbitrary 
    text. This is useful for ngram count files. 
arpa means the ARPA file format for n-gram language models. 

threads:m sets m threads (default: conccurrency detected by boost) 
batch_size:m sets the batch size for threading. Expect memory usage from this 
    of 2*threads*batch_size n-grams. 

There are two inputs: vocabulary and model. Either may be given as a file 
    while the other is on stdin. Specify the type given as a file using 
    vocab: or model: before the file name. 

For ARPA format, the output must be seekable. For raw format, it can be a 
    stream i.e. /dev/stdout 

但當我下面,卡住和無助:

$ ~/moses/bin/filter union lm.en.binary lm.filter.binary 
Assuming that lm.en.binary is a model file 
Reading lm.en.binary 
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100 

我應該做一個對二值化後的語言模型?在調整時是否有其他步驟來操縱大型語言模型以減少計算負載?

調整大型LM文件的常用方法是什麼?

如何使用KenLM的過濾器?

(上https://www.mail-archive.com/[email protected]/msg12089.html更多細節)

+0

您確定這是使MERT變慢的語言模型嗎?我對SMT很陌生,但出於某種原因,我預計翻譯模型的規模會更大。這可以通過'training/filter-model-given-input.pl'修復。 – scozy

+0

是的,這是使MERT變慢的大語言模型。我已經嘗試過各種尺寸的LM。 – alvas

回答

0

接聽如何使用KenLM

cat small_vocabulary_one_word_per_line.txt \ 
    | filter single \ 
     "model:LM_large_vocab.arpa" \ 
      output_LM_small_vocab. 

filter命令:該single可以unioncopy進行更換。如果您運行不帶參數的filter二進制文件,請閱讀正在打印的幫助。