2014-09-10 71 views
1

我試圖實現真正高效的Clojure函數來計算Damerau-Levenshtein distance。我決定使用this algorithm(附帶的源代碼應該是C++)來計算Levenshtein距離並添加一些行以使其適用於DLD。Damerau-Levenshtein距離的高效實現

以下是我在的Common Lisp(我希望它可以幫助)創建:(?函數式)

(defun damerau-levenshtein (x y) 
    (declare (type string x y) 
      #.*std-opts*) 
    (let* ((x-len (length x)) 
     (y-len (length y)) 
     (v0 (apply #'vector (mapa-b #'identity 0 y-len))) 
     (v1 (make-array (1+ y-len) :element-type 'integer)) 
     (v* (make-array (1+ y-len) :element-type 'integer))) 
    (do ((i 0 (1+ i))) 
     ((= i x-len) (aref v0 y-len)) 
     (setf (aref v1 0) (1+ i)) 
     (do ((j 0 (1+ j))) 
      ((= j y-len)) 
     (let* ((x-i (char x i)) 
       (y-j (char y j)) 
       (cost (if (char-equal x-i y-j) 0 1))) 
      (setf (aref v1 (1+ j)) (min (1+ (aref v1 j)) 
             (1+ (aref v0 (1+ j))) 
             (+ (aref v0 j) cost))) 
      (when (and (plusp i) (plusp j)) 
      (let ((x-i-1 (char x (1- i))) 
        (y-j-1 (char y (1- j))) 
        (val (+ (aref v* (1- j)) cost))) 
       (when (and (char-equal x-i y-j-1) 
         (char-equal x-i-1 y-j) 
         (< val (aref v1 (1+ j)))) 
       (setf (aref v1 (1+ j)) val)))))) 
     (rotatef v* v0 v1)))) 

現在,我怕我不能把它翻譯成真正有效和地道的Clojure代碼。我非常感謝任何建議,我認爲它對未來的許多讀者也可能非常有用。

P.S.我發現this implementation,但如果它是有效的,我懷疑它使用一些過時的contrib功能(deep-merge-withbool-to-binary):

(defn damerau-levenshtein-distance 
    [a b] 
    (let [m (count a) 
     n (count b) 
     init (apply deep-merge-with (fn [a b] b) 
        (concat 
        ;;deletion 
        (for [i (range 0 (+ 1 m))] 
         {i {0 i}}) 
        ;;insertion 
        (for [j (range 0 (+ 1 n))] 
         {0 {j j}}))) 
     table (reduce 
       (fn [d [i j]] 
       (deep-merge-with 
        (fn [a b] b) 
        d 
        (let [cost (bool-to-binary (not (= (nth a (- i 1)) 
              (nth b (- j 1))))) 
         x 
          (min 
          (+ ((d (- i 1)) 
           j) 1) ;;deletion 
          (+ ((d i) 
           (- j 1)) 1) ;;insertion 
          (+ ((d (- i 1)) 
           (- j 1)) cost)) ;;substitution)) 
         val (if (and (> i 1) 
           (> j 1) 
           (= (nth a (- i 1)) 
            (nth b (- j 2))) 
           (= (nth a (- i 2)) 
            (nth b (- j 1)))) 
         (min x (+ ((d (- i 2)) 
            (- j 2)) ;;transposition 
            cost)) 
         x)] 
        {i {j val}}))) 
       init 
       (for [j (range 1 (+ 1 n)) 
        i (range 1 (+ 1 m))] [i j]))] 
    ((table m) n))) 

回答

1

OK,這應該做的伎倆(基於KIMA's answer):

(defn da-lev [str1 str2] 
    (let [l1 (count str1) 
     l2 (count str2) 
     mx (new-matrix :ndarray (inc l1) (inc l2))] 
    (mset! mx 0 0 0) 
    (dotimes [i l1] 
    (mset! mx (inc i) 0 (inc i))) 
    (dotimes [j l2] 
    (mset! mx 0 (inc j) (inc j))) 
    (dotimes [i l1] 
    (dotimes [j l2] 
     (let [i+ (inc i) j+ (inc j) 
      i- (dec i) j- (dec j) 
      cost (if (= (.charAt str1 i) 
         (.charAt str2 j)) 
        0 1)] 
     (mset! mx i+ j+ 
       (min (inc (mget mx i j+)) 
        (inc (mget mx i+ j)) 
        (+ (mget mx i j) cost))) 
     (if (and (pos? i) (pos? j) 
        (= (.charAt str1 i) 
        (.charAt str2 j-)) 
        (= (.charAt str1 i-) 
        (.charAt str2 j))) 
      (mset! mx i+ j+ 
        (min (mget mx i+ j+) 
         (+ (mget mx i- j-) cost))))))) 
    (mget mx l1 l2))) 

請注意,您需要core.matrix庫,它是不是標準(儘管它的名字)。

[net.mikera/core.matrix "0.29.1"] 

庫生活在命名空間clojure.core.matrix:人們可以用Leiningen這種方式安裝。要按原樣使用這個解決方案,你應該從命名空間中'添加'符號到命名空間中。

2

最近,我用Clojure寫一個高效的Levenshtein距離的函數來計算之間的編輯地面實況文本和ocr引擎結果。 遞歸實現的性能不足以快速計算兩個整頁之間的levenshtein距離,所以我的實現使用動態編程。 它使用core.matrix來處理矩陣的東西,而不是下降到Java 2D數組。 爲damerau-levenshtein添加換位材料應該不難。

(defn lev [str1 str2] 
    (let [mat (new-matrix :ndarray (inc (count str1)) (inc (count str2))) 
     len1 (count str1) len2 (count str2)] 
    (mset! mat 0 0 0) 
    (dotimes [i lein1] 
    (mset! mat (inc i) 0 (inc i))) 
    (dotimes [j len2] 
    (mset! mat 0 (inc j) (inc j))) 
    (dotimes [dj len2] 
    (dotimes [di len1] 
     (let [j (inc dj) i (inc di)] 
     (mset! mat i j 
       (cond 
       (= (.charAt ^String str1 di) (.charAt ^String str2 dj)) 
       (mget mat di dj) 
       :else 
       (min (inc (mget mat di j)) (inc (mget mat i dj)) 
        (inc (mget mat di dj)))))))) 
    (mget mat len1 len2)))) 

希望這有助於