編輯Haskell中的距離算法 - 性能調優

我試圖在Haskell中實現levenshtein距離（或編輯距離），但當字符串長度增加時，其性能迅速下降。編輯Haskell中的距離算法 - 性能調優

我對Haskell還是比較新的，所以如果你能給我一些關於如何改進算法的建議，那將會很好。我已經嘗試過「預先計算」值（inits），但是因爲它沒有改變任何東西，所以我恢復了這個改變。

我知道在Hackage上已經有一個editDistance的實現，但我需要它在任意標記列表上工作，而不一定是字符串。另外，我覺得它有點複雜，至少與我的版本相比。

所以，這裏是代碼：

 
-- standard levenshtein distance between two lists 
editDistance  :: Eq a => [a] -> [a] -> Int 
editDistance s1 s2 = editDistance' 1 1 1 s1 s2 

-- weighted levenshtein distance 
-- ins, sub and del are the costs for the various operations 
editDistance'  :: Eq a => Int -> Int -> Int -> [a] -> [a] -> Int 
editDistance' _ _ ins s1 [] = ins * length s1 
editDistance' _ _ ins [] s2 = ins * length s2 
editDistance' del sub ins s1 s2 
    | last s1 == last s2 = editDistance' del sub ins (init s1) (init s2) 
    | otherwise   = minimum [ editDistance' del sub ins s1 (init s2)  + del -- deletion 
            , editDistance' del sub ins (init s1) (init s2) + sub -- substitution 
            , editDistance' del sub ins (init s1) s2  + ins -- insertion 
            ]

這似乎是一個正確的實現，至少它給了完全相同的結果，因爲這online tool。

在此先感謝您的幫助！如果您需要任何其他信息，請告訴我。

問候， BZN

來源

2011-04-01 bzn

忽略，這是一個糟糕的算法（應memoizing，我到達那個第二）...

使用O（1）原語，而不是爲O（n）

一個問題是你使用了一大堆調用O（n）的列表（haskell列表是單獨鏈接列表）。一個更好的數據結構會給你O（1）操作，我用Vector：

import qualified Data.Vector as V 

-- standard levenshtein distance between two lists 
editDistance  :: Eq a => [a] -> [a] -> Int 
editDistance s1 s2 = editDistance' 1 1 1 (V.fromList s1) (V.fromList s2) 

-- weighted levenshtein distance 
-- ins, sub and del are the costs for the various operations 
editDistance'  :: Eq a => Int -> Int -> Int -> V.Vector a -> V.Vector a -> Int 
editDistance' del sub ins s1 s2 
    | V.null s2 = ins * V.length s1 
    | V.null s1 = ins * V.length s2 
    | V.last s1 == V.last s2 = editDistance' del sub ins (V.init s1) (V.init s2) 
    | otherwise   = minimum [ editDistance' del sub ins s1 (V.init s2)  + del -- deletion 
            , editDistance' del sub ins (V.init s1) (V.init s2) + sub -- substitution 
            , editDistance' del sub ins (V.init s1) s2  + ins -- insertion 
            ]

是O（N）的列表包括初始化，length的操作，和last（雖然INIT能夠偷懶的最小）。所有這些操作都是使用Vector的O（1）。

雖然真正的標杆應該使用Criterion，一個快速和骯髒的基準：

str2 = replicate 15 'a' ++ replicate 25 'b' 
str1 = replicate 20 'a' ++ replicate 20 'b' 
main = print $ editDistance str1 str2

顯示矢量版本需要0.09秒而串採取1.6秒，所以我們節省了大約一個數量級，甚至沒有看你的editDistance算法。

現在怎麼樣記憶結果？

更大的問題顯然是需要記憶。我將此作爲了解monad-memo包裹的機會 - 我的上帝真的太棒了！對於一個額外的約束條件（您需要Ord a），您基本上不費力氣就可以進行記憶。代碼：

import qualified Data.Vector as V 
import Control.Monad.Memo 

-- standard levenshtein distance between two lists 
editDistance  :: (Eq a, Ord a) => [a] -> [a] -> Int 
editDistance s1 s2 = startEvalMemo $ editDistance' (1, 1, 1, (V.fromList s1), (V.fromList s2)) 

-- weighted levenshtein distance 
-- ins, sub and del are the costs for the various operations 
editDistance' :: (MonadMemo (Int, Int, Int, V.Vector a, V.Vector a) Int m, Eq a) => (Int, Int, Int, V.Vector a, V.Vector a) -> m Int 
editDistance' (del, sub, ins, s1, s2) 
    | V.null s2 = return $ ins * V.length s1 
    | V.null s1 = return $ ins * V.length s2 
    | V.last s1 == V.last s2 = memo editDistance' (del, sub, ins, (V.init s1), (V.init s2)) 
    | otherwise = do 
     r1 <- memo editDistance' (del, sub, ins, s1, (V.init s2)) 
     r2 <- memo editDistance' (del, sub, ins, (V.init s1), (V.init s2)) 
     r3 <- memo editDistance' (del, sub, ins, (V.init s1), s2) 
     return $ minimum [ r1 + del -- deletion 
         , r2 + sub -- substitution 
         , r3 + ins -- insertion 
            ]

您會看到memoization是如何需要一個「鍵」（請參閱MonadMemo類）？我將所有參數打包成一個很大的醜陋元組。它也需要一個「價值」，這是你的結果Int。然後，只需使用「備忘錄」功能即可即插即用您想要記憶的值。

對於基準我用較短，但較大的距離，字符串：

$ time ./so # the memoized vector version 
12 

real 0m0.003s 

$ time ./so3 # the non-memoized vector version 
12 

real 1m33.122s

千萬別想運行非memoized字符串版本，我想，這將需要大約15分鐘在最低限度。至於我，我現在喜歡monad-memo - 感謝Eduard的包裝！

編輯：String和Vector之間的差異在memoized版本中沒有那麼多，但當距離達到200左右時仍然增長到2倍，所以仍然值得。

編輯：也許我應該解釋爲什麼更大的問題是「明顯」記憶結果。好吧，如果你看一下原始算法的心臟：

[ editDistance' ... s1   (V.init s2) + del 
, editDistance' ... (V.init s1) (V.init s2) + sub 
, editDistance' ... (V.init s1) s2   + ins]

這是相當清楚的editDistance' s1 s2結果在3調用editDistance'一個電話......每一個來電editDistance'三次......還有三個時間......和AHHH！指數爆炸！幸運的是，大多數電話是相同的！例如（使用-->的「電話」和eD爲editDistance'）：

eD s1 s2 --> eD s1 (init s2)    -- The parent 
      , eD (init s1) s2 
      , eD (init s1) (init s2) 
eD (init s1) s2 --> eD (init s1) (init s2)   -- The first "child" 
        , eD (init (init s1)) s2 
        , eD (init (init s1)) (init s2) 
eD s1 (init s2) --> eD s1 (init (init s2)) 
        , eD (init s1) (init s2) 
        , eD (init s1) (init (init s2))

只需通過考慮父母和兩個孩子立即可以看到通話ed (init s1) (init s2)做三次。另一個孩子與父母共享呼叫，所有的孩子都與另一個孩子（和他們的孩子，提示Monty Python skit）共享許多呼叫。

這將是一個有趣的，也許有啓發性的練習，使runMemo類似的函數返回所使用的緩存結果的數量。

來源

2011-04-01 15:20:32

哇，這是偉大的。我以前聽說過莫諾化，但我從來沒有想到這很容易！當你說「忽略這是一個不好的算法（應該記憶，我到那一秒）......」，你是指算法的整體結構還是僅僅是應該使用記憶的事實？對我來說，算法本身看起來不錯。 :) – bzn 2011-04-01 21:14:57

bzn：我只是認爲這不是記憶的事實。如果您之前沒有看過記憶，那麼請參閱[Haskell wiki]（http://www.haskell.org/haskellwiki/Memoization），CS算法手冊，或兩者。如果沒有記憶，你可以多次計算大部分值，但是記憶只能計算一次，否則就會查找以前計算的結果。例如，要計算列表的第一個元素（'editDist s1（init s2）'），函數最終將計算'editDist（init s1）（init s2）'。這是調用者列表中的第二個元素，並且是被調用者列表中的第三個元素！ – 2011-04-01 22:09:07

@bzn我添加了一個編輯，談論爲什麼這個問題是「顯然」memoization。 – 2011-04-01 22:39:07

您需要記憶editDistance'。有很多方法可以做到這一點，例如遞歸定義的數組。

來源

2011-04-01 15:42:10 augustss

當我投票贊成你時，爲什麼會出現獨角獸和氣球？如果您在ICFP上發佈太多論文，會發生什麼？ – 2011-04-01 16:06:50

我希望我可以對獨角獸要求任何責任。 – augustss 2011-04-01 18:41:39

@TomMD這是一個SO愚人節禮物。 – sclv 2011-04-01 20:26:35

我知道有已經是editDistance實現上Hackage，但我需要它在任意標記列表的操作，不一定字符串

是否有令牌的數量有限？我建議你試着簡單地設計一個從令牌到角色的映射。畢竟有10,646 characters at your disposal。

來源

2011-04-01 16:37:30

謝謝，但現在我要使用我的解決方案，因爲像TomMD提出的那樣調整它，應該使其速度夠快 - 畢竟這是我所需要的。：P – bzn 2011-04-01 21:20:47

如前所述，memoization是你所需要的。此外，您正在查看從右到左的編輯距離，這對字符串來說效率並不高，無論方向如何，編輯距離都是相同的。那就是：editDistance (reverse a) (reverse b) == editDistance a b

爲了解決備忘錄部分有很多庫可以幫助你。在我的例子中，我選擇了MemoTrie，因爲它很容易使用並且在這裏表現很好。

import Data.MemoTrie(memo2) 

editDistance' del sub ins = memf 
    where 
    memf = memo2 f 
    f s1  []  = ins * length s1 
    f []  s2  = ins * length s2 
    f (x:xs) (y:ys) 
    | x == y = memf xs ys 
    | otherwise = minimum [ del + memf xs (y:ys), 
          sub + memf (x:xs) ys, 
          ins + memf xs ys]

正如你所看到的你所需要的是添加記憶。其餘的都是一樣的，只不過我們從最後的名單開始。

來源

2011-04-05 07:36:36 HaskellElephant

感謝您的提示。 – bzn 2011-04-07 19:47:19

備忘錄+1。這個真棒！ – Rotsor 2011-07-16 22:25:33

但爲什麼你在'f（x：xs）（y：ys）'的第一個方程中使用'f'而不是'memf'？ – Rotsor 2011-07-16 22:28:40

這個版本比那些記憶的版本快得多，但我仍然希望它更快。適用於100個字符長的字符串。我是用其他距離編寫的（改變初始化函數和成本），並使用經典的動態編程數組技巧。長長的一行可以轉換成一個單獨的函數，頂部'做'，但我喜歡這種方式。

import Data.Array.IO 
import System.IO.Unsafe 

editDistance = dist ini med 

dist :: (Int -> Int -> Int) -> (a -> a -> Int) -> [a] -> [a] -> Int 
dist i f a b = unsafePerformIO $ distM i f a b 

-- easy to create other distances 
ini i 0 = i 
ini 0 j = j 
ini _ _ = 0 
med a b = if a == b then 0 else 2 


distM :: (Int -> Int -> Int) -> (a -> a -> Int) -> [a] -> [a] -> IO Int 
distM ini f a b = do 
     let la = length a 
     let lb = length b 

     arr <- newListArray ((0,0),(la,lb)) [ini i j | i<- [0..la], j<-[0..lb]] :: IO (IOArray (Int,Int) Int) 

-- all on one line 
     mapM_ (\(i,j) -> readArray arr (i-1,j-1) >>= \ld -> readArray arr (i-1,j) >>= \l -> readArray arr (i,j-1) >>= \d-> writeArray arr (i,j) $ minimum [l+1,d+1, ld + (f (a !! (i-1)) (b !! (j-1))) ]) [(i,j)| i<-[1..la], j<-[1..lb]] 

     readArray arr (la,lb)

來源

2012-03-16 23:43:19 tolysz

All on一行不是很好的佈局......並且unsafePerformIO並不是真的必要，絕對不可取 - 代碼可以用ST monad進行重寫，只需進行很少的更改。 – Oliver 2012-05-29 11:12:08

人們推薦你使用通用的記憶化庫，但定義Levenshtein距離普通動態規劃的簡單的任務是綽綽有餘。一個非常簡單的多態基於列表的實現：

distance s t = 
    d !!(length s)!!(length t) 
    where d = [ [ dist m n | n <- [0..length t] ] | m <- [0..length s] ] 
      dist i 0 = i 
      dist 0 j = j 
      dist i j = minimum [ d!!(i-1)!!j+1 
          , d!!i!!(j-1)+1 
          , d!!(i-1)!!(j-1) + (if s!!(i-1)==t!!(j-1) 
                then 0 else 1) 
          ]

或者，如果你需要長時間序列實際速度，你可以使用一個可變數組：

import Data.Array 
import qualified Data.Array.Unboxed as UA 
import Data.Array.ST 
import Control.Monad.ST 


-- Mutable unboxed and immutable boxed arrays 
distance :: Eq a => [a] -> [a] -> Int 
distance s t = d UA.! (ls , lt) 
    where s' = array (0,ls) [ (i,x) | (i,x) <- zip [0..] s ] 
      t' = array (0,lt) [ (i,x) | (i,x) <- zip [0..] t ] 
      ls = length s 
      lt = length t 
      (l,h) = ((0,0),(length s,length t)) 
      d = runSTUArray $ do 
       m <- newArray (l,h) 0 
       for_ [0..ls] $ \i -> writeArray m (i,0) i 
       for_ [0..lt] $ \j -> writeArray m (0,j) j 
       for_ [1..lt] $ \j -> do 
           for_ [1..ls] $ \i -> do 
            let c = if s'!(i-1)==t'! (j-1) 
              then 0 else 1 
            x <- readArray m (i-1,j) 
            y <- readArray m (i,j-1) 
            z <- readArray m (i-1,j-1) 
            writeArray m (i,j) $ minimum [x+1, y+1, z+c ] 
       return m 

for_ xs f = mapM_ f xs

來源

2012-03-17 14:58:35

編輯Haskell中的距離算法 - 性能調優

回答

相關問題