2013-04-27 59 views
8

我正在用Clojure編寫一個簡單的桌面搜索引擎,作爲了解更多關於該語言的一種方式。到目前爲止,我的程序在文本處理階段的表現非常糟糕。如何提高Clojure中的文本處理性能?

在文本處理我已經到:

  • 清理不需要的字符;
  • 將字符串轉換爲小寫;
  • 拆分文檔以獲取單詞列表;
  • 構建將每個單詞與其文檔中的出現相關聯的映射。

下面是代碼:

(ns txt-processing.core 
    (:require [clojure.java.io :as cjio]) 
    (:require [clojure.string :as cjstr]) 
    (:gen-class)) 

(defn all-files [path] 
    (let [entries (file-seq (cjio/file path))] 
    (filter (memfn isFile) entries))) 

(def char-val 
    (let [value #(Character/getNumericValue %)] 
    {:a (value \a) :z (value \z) 
    :A (value \A) :Z (value \Z) 
    :0 (value \0) :9 (value \9)})) 

(defn is-ascii-alpha-num [c] 
    (let [n (Character/getNumericValue c)] 
    (or (and (>= n (char-val :a)) (<= n (char-val :z))) 
     (and (>= n (char-val :A)) (<= n (char-val :Z))) 
     (and (>= n (char-val :0)) (<= n (char-val :9)))))) 

(defn is-valid [c] 
    (or (is-ascii-alpha-num c) 
     (Character/isSpaceChar c) 
     (.equals (str \newline) (str c)))) 

(defn lower-and-replace [c] 
    (if (.equals (str \newline) (str c)) \space (Character/toLowerCase c))) 

(defn tokenize [content] 
    (let [filtered (filter is-valid content) 
     lowered (map lower-and-replace filtered)] 
    (cjstr/split (apply str lowered) #"\s+"))) 

(defn process-content [content] 
    (let [words (tokenize content)] 
    (loop [ws words i 0 hmap (hash-map)] 
     (if (empty? ws) 
     hmap 
     (recur (rest ws) (+ i 1) (update-in hmap [(first ws)] #(conj % i))))))) 

(defn -main [& args] 
    (doseq [file (all-files (first args))] 
    (let [content (slurp file) 
      oc-list (process-content content)] 
     (println "File:" (.getPath file) 
       "| Words to be indexed:" (count oc-list))))) 

當我有這個問題的another implementation在Haskell,我比較兩者,你可以在下面的輸出看到。

Clojure的版本:

$ lein uberjar 
Compiling txt-processing.core 
Created /home/luisgabriel/projects/txt-processing/clojure/target/txt-processing-0.1.0-SNAPSHOT.jar 
Including txt-processing-0.1.0-SNAPSHOT.jar 
Including clojure-1.5.1.jar 
Created /home/luisgabriel/projects/txt-processing/clojure/target/txt-processing-0.1.0-SNAPSHOT-standalone.jar 
$ time java -jar target/txt-processing-0.1.0-SNAPSHOT-standalone.jar ../data 
File: ../data/The.Rat.Racket.by.David.Henry.Keller.txt | Words to be indexed: 2033 
File: ../data/Beyond.Pandora.by.Robert.J.Martin.txt | Words to be indexed: 1028 
File: ../data/Bat.Wing.by.Sax.Rohmer.txt | Words to be indexed: 7562 
File: ../data/Operation.Outer.Space.by.Murray.Leinster.txt | Words to be indexed: 7754 
File: ../data/The.Reign.of.Mary.Tudor.by.James.Anthony.Froude.txt | Words to be indexed: 15418 
File: ../data/.directory | Words to be indexed: 3 
File: ../data/Home.Life.in.Colonial.Days.by.Alice.Morse.Earle.txt | Words to be indexed: 12191 
File: ../data/The.Dark.Door.by.Alan.Edward.Nourse.txt | Words to be indexed: 2378 
File: ../data/Storm.Over.Warlock.by.Andre.Norton.txt | Words to be indexed: 7451 
File: ../data/A.Brief.History.of.the.United.States.by.John.Bach.McMaster.txt | Words to be indexed: 11049 
File: ../data/The.Jesuits.in.North.America.in.the.Seventeenth.Century.by.Francis.Parkman.txt | Words to be indexed: 14721 
File: ../data/Queen.Victoria.by.Lytton.Strachey.txt | Words to be indexed: 10494 
File: ../data/Crime.and.Punishment.by.Fyodor.Dostoyevsky.txt | Words to be indexed: 10642 

real 2m2.164s 
user 2m3.868s 
sys  0m0.978s 

哈斯克爾版本:

$ ghc -rtsopts --make txt-processing.hs 
[1 of 1] Compiling Main    (txt-processing.hs, txt-processing.o) 
Linking txt-processing ... 
$ time ./txt-processing ../data/ +RTS -K12m 
File: ../data/The.Rat.Racket.by.David.Henry.Keller.txt | Words to be indexed: 2033 
File: ../data/Beyond.Pandora.by.Robert.J.Martin.txt | Words to be indexed: 1028 
File: ../data/Bat.Wing.by.Sax.Rohmer.txt | Words to be indexed: 7562 
File: ../data/Operation.Outer.Space.by.Murray.Leinster.txt | Words to be indexed: 7754 
File: ../data/The.Reign.of.Mary.Tudor.by.James.Anthony.Froude.txt | Words to be indexed: 15418 
File: ../data/.directory | Words to be indexed: 3 
File: ../data/Home.Life.in.Colonial.Days.by.Alice.Morse.Earle.txt | Words to be indexed: 12191 
File: ../data/The.Dark.Door.by.Alan.Edward.Nourse.txt | Words to be indexed: 2378 
File: ../data/Storm.Over.Warlock.by.Andre.Norton.txt | Words to be indexed: 7451 
File: ../data/A.Brief.History.of.the.United.States.by.John.Bach.McMaster.txt | Words to be indexed: 11049 
File: ../data/The.Jesuits.in.North.America.in.the.Seventeenth.Century.by.Francis.Parkman.txt | Words to be indexed: 14721 
File: ../data/Queen.Victoria.by.Lytton.Strachey.txt | Words to be indexed: 10494 
File: ../data/Crime.and.Punishment.by.Fyodor.Dostoyevsky.txt | Words to be indexed: 10642 

real 0m9.086s 
user 0m8.591s 
sys  0m0.463s 

我認爲( - >懶惰序列)在Clojure的實現轉換查殺性能。我該如何改進它?

P.S:這些測試中使用的所有代碼和數據可以下載here

+1

什麼jar是JVM的啓動時間嗎? Haskell有類似的開銷嗎? – georgek 2013-04-27 23:40:36

+0

我機器中的JVM啓動時間大約爲1秒。我認爲Haskell由於運行時系統(RTS)有一些開銷,但我應該比JVM低得多。 – luisgabriel 2013-04-28 00:41:41

+0

s /我應該/它應該/ – luisgabriel 2013-04-28 01:18:43

回答

4

有些事情可以做,可能會加速這一代碼了:

1)代替字符之間的映射你的charschar-val,只是做直接價值的比較。這在Java中速度更快的原因會更快。

2)重複使用str可將單字符值轉換爲完整字符串。再次考慮直接使用字符值。同樣,對象創建速度也很慢,與Java中的相同。

3)您應該用clojure.core/frequencies替換process-content。也許檢查frequencies來源,看看它是如何更快。

4)如果您必須在循環中更新(hash-map),請使用transient。請參閱:http://clojuredocs.org/clojure_core/clojure.core/transient

還要注意的是(hash-map)返回PersistentArrayMap,所以你每次調用update-in創建新實例 - 因此慢,爲什麼你應該使用瞬變。

5)這是你的朋友:(set! *warn-on-reflection* true) - 你有相當多的反思,可以從type hints

Reflection warning, scratch.clj:10:13 - call to isFile can't be resolved. 
Reflection warning, scratch.clj:13:16 - call to getNumericValue can't be resolved. 
Reflection warning, scratch.clj:19:11 - call to getNumericValue can't be resolved. 
Reflection warning, scratch.clj:26:9 - call to isSpaceChar can't be resolved. 
Reflection warning, scratch.clj:30:47 - call to toLowerCase can't be resolved. 
Reflection warning, scratch.clj:48:24 - reference to field getPath can't be resolved. 
Reflection warning, scratch.clj:48:24 - reference to field getPath can't be resolved. 
+0

完美!只是把類型提示總時間減少到7秒! ;) – luisgabriel 2013-04-28 04:53:34

+0

不錯! 8-)很高興幫助! – noahlz 2013-04-28 04:57:34

+0

我不能使用'clojure.core /頻率',因爲我需要單詞位置來進一步階段,如索引和查詢。 – luisgabriel 2013-04-28 04:59:09

0

受益只是爲了比較的緣故,這裏是一個基於正則表達式Clojure的版本

(defn re-index 
    "Returns lazy sequence of vectors of regexp matches and their start index" 
    [^java.util.regex.Pattern re s] 
    (let [m (re-matcher re s)] 
     ((fn step [] 
      (when (. m (find)) 
      (cons (vector (re-groups m)(.start m)) (lazy-seq (step)))))))) 

(defn group-by-keep 
    "Returns a map of the elements of coll keyed by the result of 
    f on each element. The value at each key will be a vector of the 
    results of r on the corresponding elements." 
    [f r coll] 
    (persistent! 
    (reduce 
     (fn [ret x] 
     (let [k (f x)] 
      (assoc! ret k (conj (get ret k []) (r x))))) 
     (transient {}) coll))) 

(defn word-indexed 
    [s] 
    (group-by-keep 
    (comp clojure.string/lower-case first) 
    second 
    (re-index #"\w+" s)))