蟒numpy的成對編輯距離

所以，我有字符串的numpy的陣列，並且我想要計算每對使用該功能元件之間的成對編輯距離：從http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.spatial.distance.pdist.html 蟒numpy的成對編輯距離

scipy.spatial.distance.pdist

我的數組的示例如下：

>>> d[0:10] 
array(['TTTTT', 'ATTTT', 'CTTTT', 'GTTTT', 'TATTT', 'AATTT', 'CATTT', 
    'GATTT', 'TCTTT', 'ACTTT'], 
    dtype='|S5')

但是，因爲它不具備「editdistance」選項，所以，我想給一個自定義的距離函數。我想這和我遇到了以下錯誤：

>>> import editdist 
>>> import scipy 
>>> import scipy.spatial 
>>> scipy.spatial.distance.pdist(d[0:10], lambda u,v: editdist.distance(u,v)) 

Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    File "/usr/local/epd-7.3.2/lib/python2.7/site-packages/scipy/spatial/distance.py", line 1150, in pdist 
    [X] = _copy_arrays_if_base_present([_convert_to_double(X)]) 
    File "/usr/local/epd-7.3.2/lib/python2.7/site-packages/scipy/spatial/distance.py", line 153, in _convert_to_double 
    X = np.double(X) 
ValueError: could not convert string to float: TTTTT

來源

2014-06-06 Vahid Mir

看起來只是不適合字符串。您可能需要查看https://docs.python.org/2/library/difflib.html – Pavel

該錯誤行是'pdist'中的第二行。因此，在將字符串傳遞給'pdist'之前，您需要將字符串轉換爲某種編號。 'pdist'也想要一個2D數組。 – hpaulj

如果你真的必須使用pdist，首先需要您的字符串轉換爲數字格式。如果你知道所有字符串都將是相同的長度，你可以這樣做比較容易：

numeric_d = d.view(np.uint8).reshape((len(d),-1))

這只是查看你的字符串作爲uint8字節長數組的數組，然後重新塑造它使得每個原始字符串本身就是一排。在你的榜樣，這將是這樣的：

In [18]: d.view(np.uint8).reshape((len(d),-1)) 
Out[18]: 
array([[84, 84, 84, 84, 84], 
     [65, 84, 84, 84, 84], 
     [67, 84, 84, 84, 84], 
     [71, 84, 84, 84, 84], 
     [84, 65, 84, 84, 84], 
     [65, 65, 84, 84, 84], 
     [67, 65, 84, 84, 84], 
     [71, 65, 84, 84, 84], 
     [84, 67, 84, 84, 84], 
     [65, 67, 84, 84, 84]], dtype=uint8)

然後，你可以使用pdist像平時那樣。只要確保你的editdist函數所期望的是整數數組，而不是字符串。你可以通過調用.tostring()快速轉換新的輸入：

def editdist(x, y): 
    s1 = x.tostring() 
    s2 = y.tostring() 
    ... rest of function as before ...

來源

2014-06-07 06:50:47 perimosocordiae

...或者直接在uint8上做編輯距離。 – eickenberg

-4

def my_pdist(data,f): 
 
    N=len(data) 
 
    matrix=np.empty([N*(N-1)/2]) 
 
    ind=0 
 
    for i in range(N): 
 
     for j in range(i+1,N): 
 
      matrix[ind]=f(data[i],data[j]) 
 
      ind+=1 
 
    return matrix

來源

2016-10-07 19:26:05

提供一些背景信息，關於這是如何回答這個問題的附加信息，不僅對問題的原始提問者，而且對未來這個解決方案的訪問者有益。原始的「僅限代碼」片段不是答案的最佳形式。 – gravity

蟒numpy的成對編輯距離

回答

相關問題