2013-07-05 75 views
1

儘管我已經看到了很多與此有關的問題,但我並沒有真正得到答案,可能是因爲我是使用nltk集羣的新手。 我真的需要一個新手聚類的基本解釋,特別是NLTK K均值聚類的向量表示以及如何使用它。我有一個單詞列表,如[貓,狗,小貓,小狗等]和另外兩個單詞列表,如[食肉動物,草食動物,寵物等]和[哺乳動物,家庭等]。我希望能夠根據第一個使用第一個作爲平均值或質心來對最後兩個單詞列表進行聚類。我曾嘗試過,我收到AssertionError像這樣:nltk使用純python的k-means聚類或k-means

clusterer = cluster.KMeansClusterer(2, euclidean_distance, initial_means=means) 
    File "C:\Python27\lib\site-packages\nltk\cluster\kmeans.py", line 64, in __init__ 
    assert not initial_means or len(initial_means) == num_means 

AND 
    print clusterer.cluster(vectors, True) 
    File "C:\Python27\lib\site-packages\nltk\cluster\util.py", line 55, in cluster 
    self.cluster_vectorspace(vectors, trace) 
    File "C:\Python27\lib\site-packages\nltk\cluster\kmeans.py", line 82, in cluster_vectorspace 
    self._cluster_vectorspace(vectors, trace) 
    File "C:\Python27\lib\site-packages\nltk\cluster\kmeans.py", line 113, in _cluster_vectorspace 
    index = self.classify_vectorspace(vector) 
    File "C:\Python27\lib\site-packages\nltk\cluster\kmeans.py", line 137, in classify_vectorspace 
    dist = self._distance(vector, mean) 
    File "C:\Python27\lib\site-packages\nltk\cluster\util.py", line 118, in euclidean_distance 
    diff = u - v 
TypeError: unsupported operand type(s) for -: 'numpy.ndarray' and 'numpy.ndarray' 

我認爲我的意思是我的意思是向量表示。向量表示和示例代碼的基本示例將受到高度讚賞。任何使用nltk或純Python的解決方案將不勝感激。預先感謝您的友善迴應

+0

你會介意指出一些與此有關的問題嗎? :) – arturomp

+0

如果你比較字符串,你不應該使用漢明或levenstein距離而不是歐幾里得? – Akavall

回答

1

如果我正確理解您的問題,這樣的事情應該工作。 kmeans的難點在於找到聚類中心,如果您已經找到了聚類中心或知道您想要的聚類中心,則可以:對於每個點,找到與每個聚類中心的距離並將該點指定到最近的聚類中心。

(作爲一個方面說明sklearn是一個很大的包集羣和機器學習一般。)

在你的榜樣應該是這樣的:

Levenstein

# levenstein function is not my implementation; I copied it from the 
# link above 
def levenshtein(s1, s2): 
    if len(s1) < len(s2): 
     return levenshtein(s2, s1) 

    # len(s1) >= len(s2) 
    if len(s2) == 0: 
     return len(s1) 

    previous_row = xrange(len(s2) + 1) 
    for i, c1 in enumerate(s1): 
     current_row = [i + 1] 
     for j, c2 in enumerate(s2): 
      insertions = previous_row[j + 1] + 1 # j+1 instead of j since previous_row and current_row are one character longer 
      deletions = current_row[j] + 1  # than s2 
      substitutions = previous_row[j] + (c1 != c2) 
      current_row.append(min(insertions, deletions, substitutions)) 
     previous_row = current_row 

    return previous_row[-1] 

def get_closest_lev(cluster_center_words, my_word): 
    closest_center = None 
    smallest_distance = float('inf') 
    for word in cluster_center_words: 
     ld = levenshtein(word, my_word) 
     if ld < smallest_distance: 
      smallest_distance = ld 
      closest_center = word 
    return closest_center 

def get_clusters(cluster_center_words, other_words): 
    cluster_dict = {} 
    for word in cluster_center_words: 
     cluster_dict[word] = [] 
    for my_word in other_words: 
     closest_center = get_closest_lev(cluster_center_words, my_word) 
     cluster_dict[closest_center].append(my_word) 
    return cluster_dict 

例子:

cluster_center_words = ['dog', 'cat'] 
other_words = ['dogg', 'kat', 'frog', 'car'] 

Res ult:

>>> get_clusters(cluster_center_words, other_words) 
{'dog': ['dogg', 'frog'], 'cat': ['kat', 'car']} 
+0

同意,scikit的sklearn對統計nlp非常有用。 – alvas

+1

@akavall:這很好,但我們如何才能使這種語義分類而不是基於字符距離?就像把魚一樣分類爲一條魚,以及所有鳥類一樣是鳥? –