2016-11-07 58 views
3
from spacy.en import English 
from numpy import dot 
from numpy.linalg import norm 

parser = English() 

# you can access known words from the parser's vocabulary 
nasa = parser.vocab['NASA'] 

# cosine similarity 
cosine = lambda v1, v2: dot(v1, v2)/(norm(v1) * norm(v2)) 

# gather all known words, take only the lowercased versions 
allWords = list({w for w in parser.vocab if w.has_repvec and w.orth_.islower() and w.lower_ != "nasa"}) 

# sort by similarity to NASA 
allWords.sort(key=lambda w: cosine(w.repvec, nasa.repvec)) 
allWords.reverse() 
print("Top 10 most similar words to NASA:") 
for word in allWords[:10]: 
    print(word.orth_) 

例子問題我試圖運行上面的例子,但我得到以下錯誤:詞矢量在spacy

Traceback (most recent call last): 
File "C:\Users\bulusu.kiran\Documents\WORK\nlp\wordVectors1.py", line 8, in <module> 
nasa = parser.vocab['NASA'] 
File "spacy/vocab.pyx", line 330, in spacy.vocab.Vocab.__getitem__ (spacy/vocab.cpp:7708) 
orth = id_or_string TypeError: an integer is required 

例子取自:Intro to NLP with spaCy

是什麼原因造成這個錯誤?

+0

你發佈的很好的例子,如果他們只有他們在他們的文檔中有這樣的事情.. – cardamom

回答

6

您使用的是什麼版本的Python?這可能是Unicode錯誤的結果;我把它用

nasa = parser.vocab[u'NASA'] 

更換

nasa = parser.vocab['NASA'] 

在Python 2.7的工作,然後你會得到這個錯誤:

AttributeError: 'spacy.lexeme.Lexeme' object has no attribute 'has_repvec' 

有一個similar issue on the SpaCy repo,但這些都可以固定通過用has_vectorrepvec替換has_repvecvector。我也會對該GitHub主題發表評論。

完整,更新的代碼我使用:

import spacy 

from numpy import dot 
from numpy.linalg import norm 

parser = spacy.load('en') 
nasa = parser.vocab[u'NASA'] 

# cosine similarity 
cosine = lambda v1, v2: dot(v1, v2)/(norm(v1) * norm(v2)) 

# gather all known words, take only the lowercased versions 
allWords = list({w for w in parser.vocab if w.has_vector and w.orth_.islower() and w.lower_ != "nasa"}) 

# sort by similarity to NASA 
allWords.sort(key=lambda w: cosine(w.vector, nasa.vector)) 
allWords.reverse() 
print("Top 10 most similar words to NASA:") 
for word in allWords[:10]: 
    print(word.orth_) 

希望這有助於!

+1

謝謝埃裏克,它像一個魅力。 – phani