Gensim：TypeError：doc2bow需要輸入unicode標記數組，而不是單個字符串

我從一些python任務開始，我在使用gensim時遇到了一個問題。我想從我的硬盤加載文件並對其進行處理（將它們分割和小寫（）它們）Gensim：TypeError：doc2bow需要輸入unicode標記數組，而不是單個字符串

我的代碼如下：

dictionary_arr=[] 
for file_path in glob.glob(os.path.join(path, '*.txt')): 
    with open (file_path, "r") as myfile: 
     text=myfile.read() 
     for words in text.lower().split(): 
      dictionary_arr.append(words) 
dictionary = corpora.Dictionary(dictionary_arr)

名單（dictionary_arr）中包含的所有單詞列表在所有文件中，我使用gensim corpora.Dictionary來處理列表。但是我面臨一個錯誤。

TypeError: doc2bow expects an array of unicode tokens on input, not a single string

我不能理解什麼是問題，一點指導，將不勝感激。

來源

2015-10-20 Sam

在dictionary.py，初始化功能是：

def __init__(self, documents=None): 
    self.token2id = {} # token -> tokenId 
    self.id2token = {} # reverse mapping for token2id; only formed on request, to save memory 
    self.dfs = {} # document frequencies: tokenId -> in how many documents this token appeared 

    self.num_docs = 0 # number of documents processed 
    self.num_pos = 0 # total number of corpus positions 
    self.num_nnz = 0 # total number of non-zeroes in the BOW matrix 

    if documents is not None: 
     self.add_documents(documents)

功能add_documents構建字典從文檔的集合。每個文檔標記列表：

def add_documents(self, documents): 

    for docno, document in enumerate(documents): 
     if docno % 10000 == 0: 
      logger.info("adding document #%i to %s" % (docno, self)) 
     _ = self.doc2bow(document, allow_update=True) # ignore the result, here we only care about updating token ids 
    logger.info("built %s from %i documents (total %i corpus positions)" % 
       (self, self.num_docs, self.num_pos))

所以，如果你用這種方式初始化字典，你必須通過文件，但沒有一個文件。例如，

dic = corpora.Dictionary([a.split()])

是確定的。

來源

2015-10-20 07:37:13 wyq10

嗨wyq10，我試過這個方法，它似乎工作，但是有一個小問題。字典中所有令牌的計數（頻率）保持相同，即1，儘管許多令牌的頻率大於1 – Sam

Gensim：TypeError：doc2bow需要輸入unicode標記數組，而不是單個字符串

回答

相關問題