如何在製作語料庫時增加gensim的字典大小？

我用如何在製作語料庫時增加gensim的字典大小？

background_corpus = TextCorpus('wiki.en.text')

這是一個10 GB的文件，這樣同時使這個語料庫並將其添加到字典它給這個

adding document #820000 to Dictionary(2000000 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'billycorgan', u'olmsville']...) 

discarding 31072 tokens: [(u'vnsas', 1), (u'ezequeel', 1), (u'trapeztafel', 1), (u'pubsub', 1), (u'gyvenimas', 1), (u'gilibrand', 1), (u'catfaced', 1), (u'beuningan', 1), (u'moodadi', 1), (u'nocaster', 1)]... 

keeping 2000000 tokens which were in no less than 0 and no more than 830000 (=100.0%) documents

因此，它丟棄了新的指令製作語料庫因爲它的最大尺寸是2000000.無論如何，我無法限制字典的大小？

來源

2016-05-31 user3481478

下面是解釋https://radimrehurek.com/gensim/corpora/dictionary.html。參數prune_at設置爲2000000，具體取決於您使用的功能，您可以將其更改爲None以避免丟棄問題。編輯：在gensim/corpora/dictionary.py中（在init函數當前版本中，第45行），您可以設置prune_at = None或設置自己的限制（例如使用prune_at = 5000000設置5000000）。

來源

2017-05-10 11:31:18

如何在製作語料庫時增加gensim的字典大小？

回答

相關問題