Scipy負距離？什麼？

我有一個包含浮點數到4個小數位的輸入文件：Scipy負距離？什麼？

i.e. 13359 0.0000 0.0000 0.0001 0.0001 0.0002` 0.0003 0.0007 ...

（第一是id）。我的班級使用loadVectorsFromFile方法，將其乘以10000，然後使用int()這些數字。最重要的是，我還循環遍歷每個向量，以確保裏面沒有負值。但是，當我執行_hclustering時，我不斷看到錯誤，"Linkage Z contains negative values"。

我真的認爲這是一個錯誤，這是因爲：

我檢查了我的價值觀，
值是沒有地方足夠小或大到足以接近浮點數和
我用來派生文件中的值的公式使用絕對值（我的輸入是DEFINITELY正確）。

有人可以讓我知道爲什麼我看到這個奇怪的錯誤嗎？究竟是什麼導致了這種負距離誤差？

=====

def loadVectorsFromFile(self, limit, loc, assertAllPositive=True, inflate=True): 
    """Inflate to prevent "negative" distance, we use 4 decimal points, so *10000 
    """ 
    vectors = {} 
    self.winfo("Each vector is set to have %d limit in length" % limit) 
    with open(loc) as inf: 
     for line in filter(None, inf.read().split('\n')): 
      l = line.split('\t') 
      if limit: 
       scores = map(float, l[1:limit+1]) 
      else: 
       scores = map(float, l[1:]) 

      if inflate:   
       vectors[ l[0]] = map(lambda x: int(x*10000), scores)  #int might save space 
      else: 
       vectors[ l[0]] = scores       

    if assertAllPositive: 
     #Assert that it has no negative value 
     for dirID, l in vectors.iteritems(): 
      if reduce(operator.or_, map(lambda x: x < 0, l)): 
       self.werror("Vector %s has negative values!" % dirID) 
    return vectors 

def main(self, inputDir, outputDir, limit=0, 
     inFname="data.vectors.all", mappingFname='all.id.features.group.intermediate'): 
    """ 
    Loads vector from a file and start clustering 
    INPUT 
     vectors is { featureID: tfidfVector (list), } 
    """ 
    IDFeatureDic = loadIdFeatureGroupDicFromIntermediate(pjoin(self.configDir, mappingFname)) 
    if not os.path.exists(outputDir): 
     os.makedirs(outputDir) 

    vectors = self.loadVectorsFromFile(limit, pjoin(inputDir, inFname)) 
    for threshold in map(lambda x:float(x)/30, range(20,30)): 
     clusters = self._hclustering(threshold, vectors) 
     if clusters: 
      outputLoc = pjoin(outputDir, "threshold.%s.result" % str(threshold)) 
      with open(outputLoc, 'w') as outf: 
       for clusterNo, cluster in clusters.iteritems(): 
        outf.write('%s\n' % str(clusterNo)) 
        for featureID in cluster: 
         feature, group = IDFeatureDic[featureID] 
         outline = "%s\t%s\n" % (feature, group) 
         outf.write(outline.encode('utf-8')) 
        outf.write("\n") 
     else: 
      continue 

def _hclustering(self, threshold, vectors): 
    """function which you should call to vary the threshold 
    vectors: { featureID: [ tfidf scores, tfidf score, .. ] 
    """ 
    clusters = defaultdict(list) 
    if len(vectors) > 1: 
     try: 
      results = hierarchy.fclusterdata(vectors.values(), threshold, metric='cosine') 
     except ValueError, e: 
      self.werror("_hclustering: %s" % str(e)) 
      return False 

     for i, featureID in enumerate(vectors.keys()):

來源

2010-04-07 disappearedng

我有這個問題在SciPy的 - 意外負值。這個問題（對我來說）是我不知道Scipy中的trig函數默認是弧度值。 – doug 2010-04-07 07:21:29

我敢肯定，這是因爲你使用的是餘弦度量時您呼叫fclusterdata。嘗試使用歐幾里得，並看看錯誤消失。

如果集合中兩個向量的點積大於1，則餘弦度量可能爲負值。由於您使用的數量非常大並且對它們進行了歸一化，所以我很確定點積大於1 a很多時間在你的數據集中。如果要使用餘弦度量標準，則需要對數據進行標準化，以使兩個矢量的點積不會大於1.請參閱this page上的公式以查看Scipy中定義的餘弦度量標準。

編輯：

那麼從看源代碼，我認爲這頁上列出的公式實際上並沒有那麼SciPy的用途（這是很好的，因爲源代碼看起來是公式使用正常和正確的餘弦距離公式）。然而，在聯繫創造的時候，無論出於何種原因，聯繫顯然都有一些負面的價值。嘗試使用method ='cosine'找到您的矢量與scipy.spatial.distance.pdist（）之間的距離並檢查負值。如果沒有，那麼它與如何使用距離值形成連接有關。

來源

2010-04-07 05:18:54

很棒的回答。關於「規範化您的數據」，我有哪些選項來規範化我的數據，以便我仍然可以使用scipy中原生的餘弦距離？我試過計算沒有任何形式的規範化，（只使用本地tfidf值）。毋庸置疑，由於在這麼長的時間內添加的浮點數的不準確性，問題仍然存在。你會推薦我做什麼？ – disappearedng 2010-04-07 08:47:29

首先，您應該檢查以查看問題出在哪裏。它是在距離計算之後？如果餘弦方法正確完成（我認爲現在儘管文檔中有其他說明），那麼就不需要標準化。順便說一下，嘗試使用'old_cosine'作爲您的指標，看看您是否仍然有錯誤。 – 2010-04-07 14:05:47

我無法改進Justin的答案，但另一個值得注意的地方是您的數據處理。

你說你要做點像int(float("0.0003") * 10000)來讀取數據。但如果你這樣做，你不會得到3，但2.9999999999999996。這是因爲浮點不準確性正好相乘。

更好，或者至少更準確。方法是通過在字符串中進行乘法。也就是說，使用字符串操作從0.0003到3.0等等。

也許甚至有一個Python數據類型擴展的地方可以讀取這種類型的數據，而不會損失精度，您可以在轉換之前執行乘法。我不在SciPy /數字中，所以我不知道。

編輯

賈斯汀評論說，有蟒蛇內小數類型的構建。這可以解釋字符串，乘以整數並轉換爲浮點數（我測試過）。這種情況下，我會建議更新你的邏輯，如：

factor = 1 
if inflate: 
    factor = 10000 
scores = map(lambda x: float(decimal.Decimal(x) * factor), l[1:])

這將減少你的舍入問題一點。

來源

2010-04-07 06:16:05 extraneon

是的，有這樣一個模塊。它被稱爲十進制。 http://docs.python.org/library/decimal.html – 2010-04-07 14:06:50

這是由於浮點不準確，所以向量之間的某些距離（而不是0）例如爲-0.000000000000000002。使用scipy.clip()函數來糾正問題。如果距離矩陣爲dmatr，則使用numpy.clip(dmatr,0,1,dmatr)，您應該沒問題。

來源

2012-06-05 16:58:17 dkar

「連接Z包含負值」。當鏈接矩陣中的任何鏈接簇索引被賦值爲-1時，scipy heirarchical聚類過程中也會發生此錯誤。

根據我的觀察，任何連鎖羣集索引在組合過程期間被賦值爲-1，當所有羣集或要組合的點之間的距離變爲負無窮大時。所以即使鏈接距離是無限的，鏈接函數也會將這些鏈結合起來。並指定羣集或點負折射率的一個

總結所以問題是，如果你使用cosine distance作爲指標，如果任何數據點的規範或大小爲零，那麼這個錯誤會發生

來源

2015-06-27 14:05:33

我遇到過同樣的問題。你可以做的是重寫餘弦函數。例如：

from sklearn.metrics.pairwise import cosine_similarity 
def mycosine(x1, x2): 
    x1 = x1.reshape(1,-1) 
    x2 = x2.reshape(1,-1) 
    ans = 1 - cosine_similarity(x1, x2) 
    return max(ans[0][0], 0)

...

clusters = hierarchy.fclusterdata(data, threshold, criterion='distance', metric=mycosine, method='average')

來源

2016-02-23 08:58:48

Scipy負距離？什麼？

回答

相關問題