我有一個包含浮點數到4個小數位的輸入文件:Scipy負距離?什麼?
i.e. 13359 0.0000 0.0000 0.0001 0.0001 0.0002` 0.0003 0.0007 ...
(第一是id)。 我的班級使用loadVectorsFromFile
方法,將其乘以10000,然後使用int()
這些數字。最重要的是,我還循環遍歷每個向量,以確保裏面沒有負值。但是,當我執行_hclustering
時,我不斷看到錯誤,"Linkage
Z contains negative values"
。
我真的認爲這是一個錯誤,這是因爲:
- 我檢查了我的價值觀,
- 值是沒有地方足夠小或大到足以接近浮點數和 的限制
- 我用來派生文件中的值的公式使用絕對值(我的輸入是DEFINITELY正確)。
有人可以讓我知道爲什麼我看到這個奇怪的錯誤嗎?究竟是什麼導致了這種負距離誤差?
=====
def loadVectorsFromFile(self, limit, loc, assertAllPositive=True, inflate=True):
"""Inflate to prevent "negative" distance, we use 4 decimal points, so *10000
"""
vectors = {}
self.winfo("Each vector is set to have %d limit in length" % limit)
with open(loc) as inf:
for line in filter(None, inf.read().split('\n')):
l = line.split('\t')
if limit:
scores = map(float, l[1:limit+1])
else:
scores = map(float, l[1:])
if inflate:
vectors[ l[0]] = map(lambda x: int(x*10000), scores) #int might save space
else:
vectors[ l[0]] = scores
if assertAllPositive:
#Assert that it has no negative value
for dirID, l in vectors.iteritems():
if reduce(operator.or_, map(lambda x: x < 0, l)):
self.werror("Vector %s has negative values!" % dirID)
return vectors
def main(self, inputDir, outputDir, limit=0,
inFname="data.vectors.all", mappingFname='all.id.features.group.intermediate'):
"""
Loads vector from a file and start clustering
INPUT
vectors is { featureID: tfidfVector (list), }
"""
IDFeatureDic = loadIdFeatureGroupDicFromIntermediate(pjoin(self.configDir, mappingFname))
if not os.path.exists(outputDir):
os.makedirs(outputDir)
vectors = self.loadVectorsFromFile(limit, pjoin(inputDir, inFname))
for threshold in map(lambda x:float(x)/30, range(20,30)):
clusters = self._hclustering(threshold, vectors)
if clusters:
outputLoc = pjoin(outputDir, "threshold.%s.result" % str(threshold))
with open(outputLoc, 'w') as outf:
for clusterNo, cluster in clusters.iteritems():
outf.write('%s\n' % str(clusterNo))
for featureID in cluster:
feature, group = IDFeatureDic[featureID]
outline = "%s\t%s\n" % (feature, group)
outf.write(outline.encode('utf-8'))
outf.write("\n")
else:
continue
def _hclustering(self, threshold, vectors):
"""function which you should call to vary the threshold
vectors: { featureID: [ tfidf scores, tfidf score, .. ]
"""
clusters = defaultdict(list)
if len(vectors) > 1:
try:
results = hierarchy.fclusterdata(vectors.values(), threshold, metric='cosine')
except ValueError, e:
self.werror("_hclustering: %s" % str(e))
return False
for i, featureID in enumerate(vectors.keys()):
我有這個問題在SciPy的 - 意外負值。這個問題(對我來說)是我不知道Scipy中的trig函數默認是弧度值。 – doug 2010-04-07 07:21:29