在gensim,當我給一個字符串作爲培訓doc2vec模型輸入,我得到這個錯誤:doc2vec - 在蟒蛇doc2vec培訓和infer_vector()輸入格式
類型錯誤(「不\」知道如何處理URI%s'的再版%(URI))
我提到這個問題Doc2vec : TaggedLineDocument() 但仍然有一個關於輸入格式懷疑。
documents = TaggedLineDocument('myfile.txt')
如若MYFILE.TXT擁有令牌的名單列表或單獨的列表中的每一行對每個文檔或字符串?
For eg
- 我有2個文件。文檔1:機器學習是模式識別研究演變而來的計算機科學的一個子領域。 Doc 2:Arthur Samuel將機器學習定義爲「爲計算機提供學習能力的研究領域」。
那麼,myFile.txt
應該是什麼樣子?
案例1:在每行的每個文檔的簡單文本
機器學習是計算機科學的一個分支,從模式識別的研究發展
阿瑟·塞繆爾定義機器學習作爲研究的一個領域,讓電腦學習
案例2的能力:具有每個文件
[ ["Machine", "learning", "is", "a", "subfield", "of", "computer", "science", "that", "evolved", "from", "the", "study", "of", "pattern", "recognition"]
,
["Arthur", "Samuel", "defined", "machine", "learning", "as", "a", "Field", "of", "study", "that", "gives", "computers" ,"the", "ability", "to", "learn"] ]
案例3:在一個單獨的行中的每個文檔的令牌列表
["Machine", "learning", "is", "a", "subfield", "of", "computer", "science", "that", "evolved", "from", "the", "study", "of", "pattern", "recognition"]
["Arthur", "Samuel", "defined", "machine", "learning", "as", "a", "Field", "of", "study", "that", "gives", "computers" ,"the", "ability", "to", "learn"]
,當我對測試數據運行它,應該是什麼,我想預測的句子格式doc向量爲?它應該像案例1還是案例2或其他什麼?
model.infer_vector(testSentence, alpha=start_alpha, steps=infer_epoch)
如若testSentence是:
情況1:串
testSentence = "Machine learning is an evolving field"
情況2:令牌
testSentence = ["Machine", "learning", "is", "an", "evolving", "field"]