在Gensim.Doc2Vec

應用類似的功能

我試圖讓doc2vec功能在Python 3 工作，我有以下代碼：在Gensim.Doc2Vec

tekstdata = [[ index, str(row["StatementOfTargetFiguresAndPoliciesForTheUnderrepresentedGender"])] for index, row in data.iterrows()] 
def prep (x): 
    low = x.lower() 
    return word_tokenize(low) 

def cleanMuch(data, clean): 
    output = [] 
    for x, y in data: 
     z = clean(y) 
     output.append([str(x), z]) 
    return output 

tekstdata = cleanMuch(tekstdata, prep) 

def tagdocs(docs): 
    output = []  
    for x,y in docs: 
     output.append(gensim.models.doc2vec.TaggedDocument(y, x)) 
    return output 
    tekstdata = tagdocs(tekstdata) 

    print(tekstdata[100]) 

vectorModel = gensim.models.doc2vec.Doc2Vec(tekstdata, size = 100, window = 4,min_count = 3, iter = 2) 


ranks = [] 
second_ranks = [] 
for x, y in tekstdata: 
print (x) 
print (y) 
inferred_vector = vectorModel.infer_vector(y) 
sims = vectorModel.docvecs.most_similar([inferred_vector], topn=1001, restrict_vocab = None) 
rank = [docid for docid, sim in sims].index(y) 
ranks.append(rank)

所有作品，只要我能理解，直到排名功能。我得到的錯誤是我的列表中沒有零例如我輸入的文件中沒有10個列表：

File "C:/Users/Niels Helsø/Documents/github/Speciale/Test/Data prep.py", line 59, in <module> 
rank = [docid for docid, sim in sims].index(y) 

ValueError: '10' is not in list

在我看來，它是類似的功能，不工作。該模型訓練我的數據（1000個文檔）並構建一個被標記的詞彙。我主要使用了該文檔是這樣的： Gensim dokumentation Torturial

我希望有人可以提供幫助。如果需要其他信息，請告訴我。最好尼爾斯

來源

2017-10-04 Niels Helsø

如果您收到ValueError: '10' is not in list，你可以依靠的事實是'10'不在列表中。那麼你看過這個清單，看看有什麼，如果它符合你的期望？

從代碼摘錄中不清楚tagdocs()是否曾被調用，因此不清楚tekstdata提供給Doc2Vec時所處的形式。意圖有點複雜，沒有什麼可以顯示數據顯示的原始形式。

但是，您提供給TaggedDocument的tags可能不是所需的標籤列表，而是一個簡單的字符串，它將被解釋爲字符列表。因此，即使您提供的tags的'10'，它將被視爲['1', '0'] - 而len(vectorModel.doctags)將僅爲10（10個單位數字符串）。

你的設置獨立意見：

1000文檔是Doc2Vec，大部分公佈的結果用數萬的-十萬到數百萬個文檔
10-20的iter是比較常見的非常小在Doc2Vec工作（甚至更大的值可能有助於較小的數據集）
infer_vector()通常在其可選參數中非默認值更好，尤其是steps更大（20-200）或起始alpha這更像是批量訓練默認值（0.025）

來源

2017-10-04 21:56:56 gojomo

謝謝gojomo。你提示工作。最好 –

回答

相關問題