0
background_corpus = TextCorpus('wiki.en.text')
這是一個10 GB的文件,這樣同時使這個語料庫並將其添加到字典它給這個
adding document #820000 to Dictionary(2000000 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'billycorgan', u'olmsville']...)
discarding 31072 tokens: [(u'vnsas', 1), (u'ezequeel', 1), (u'trapeztafel', 1), (u'pubsub', 1), (u'gyvenimas', 1), (u'gilibrand', 1), (u'catfaced', 1), (u'beuningan', 1), (u'moodadi', 1), (u'nocaster', 1)]...
keeping 2000000 tokens which were in no less than 0 and no more than 830000 (=100.0%) documents
因此,它丟棄了新的指令製作語料庫因爲它的最大尺寸是2000000.無論如何,我無法限制字典的大小?