Clojure的頻率詞典

我想寫我自己的樸素貝葉斯分類器我有一個這樣的文件：Clojure的頻率詞典

（這是垃圾郵件和火腿的消息，第一個字點的數據庫，以垃圾郵件或火腿，文本，直到EOLN是消息（尺寸：0.5 MB）從這裏http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/）

ham  Go until jurong point, crazy.. Available only in bugis n gre 
at world la e buffet... Cine there got amore wat... 
ham  Ok lar... Joking wif u oni... 
spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's 
ham  U dun say so early hor... U c already then say... 
ham  Nah I don't think he goes to usf, he lives around here though 
spam FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv

，我想做出這樣一個HashMap： { 「垃圾郵件」{ 「走出去」 1 「直到」 100，...} ，「火腿」{......}} 哈希圖，其中每個值都是單詞的頻率圖（對於火腿和垃圾郵件分開）

我知道，如何通過Python或C++做，我用Clojure的做到了，但我的解決方案失敗（計算器）在大型數據

我的解決辦法：

(defn read_data_from_file [fname] 
    (map #(split % #"\s")(map lower-case (with-open [rdr (reader fname)] 
     (doall (line-seq rdr)))))) 

(defn do-to-map [amap keyseq f] 
    (reduce #(assoc %1 %2 (f (%1 %2))) amap keyseq)) 

(defn dicts_from_data [raw_data] 
    (let [data (group-by #(first %) raw_data)] 
     (do-to-map 
      data (keys data) 
       (fn [x] (frequencies (reduce concat (map #(rest %) x)))))))

我tryed到找到它假的，寫這

(def raw_data (read_data_from_file (first args))) 
(def d (group-by #(first %) raw_data)) 
(def f (map frequencies raw_data)) 
(def d1 (reduce concat (d "spam"))) 
(println (reduce concat (d "ham")))

錯誤：

Exception in thread "main" java.lang.RuntimeException: java.lang.StackOverflowError 
    at clojure.lang.Util.runtimeException(Util.java:165) 
    at clojure.lang.Compiler.eval(Compiler.java:6476) 
    at clojure.lang.Compiler.eval(Compiler.java:6455) 
    at clojure.lang.Compiler.eval(Compiler.java:6431) 
    at clojure.core$eval.invoke(core.clj:2795) 
    at clojure.main$eval_opt.invoke(main.clj:296) 
    at clojure.main$initialize.invoke(main.clj:315) 
.....

任何人都可以幫助我做到這一點更好/有效？ PS抱歉我寫錯了。英語不是我的母語。

來源

2013-06-26 Dark_Daiver

在匿名函數中使用apply而不是reduce避免了StackOverflow異常。而不是(fn [x] (frequencies (reduce concat (map #(rest %) x))))使用(fn [x] (frequencies (apply concat (map #(rest %) x))))。

以下是相同的代碼有點重構，但具有完全相同的邏輯。 read-data-from-file已更改爲避免map在兩行的序列上ping。

(use 'clojure.string) 
(use 'clojure.java.io) 

(defn read-data-from-file [fname] 
    (let [lines (with-open [rdr (reader fname)] 
       (doall (line-seq rdr)))] 
    (map #(-> % lower-case (split #"\s")) lines))) 

(defn do-to-map [m keyseq f] 
    (reduce #(assoc %1 %2 (f (%1 %2))) m keyseq)) 

(defn process-words [x] 
    (->> x 
    (map #(rest %)) 
    (apply concat) ; This is the only real change from the 
        ; original code, it used to be (reduce concat). 
    frequencies)) 

(defn dicts-from-data [raw_data] 
    (let [data (group-by first raw_data)] 
    (do-to-map data 
       (keys data) 
       process-words))) 

(-> "SMSSpamCollection.txt" read-data-from-file dicts-from-data keys)

來源

2013-06-26 19:49:37

（ - > F1 F2）等同放着清單（F1（F2數據））？ –

實際上它等同於'（f2（f1 data））'，表格是從左到右應用的。欲瞭解更多信息，請查看Fogus的[this]（http://blog.fogus.me/2009/09/04/understanding-the-clojure-macro/）。你也可以找到一些線程宏的例子 - >'和' - >>'，[here]（http://clojuredocs.org/clojure_core/clojure.core/-%3E）和[here]（http： //clojuredocs.org/clojure_core/clojure.core/-%3E%3E）。 –

我誤會了我的第一條評論。謝謝！ –

要考慮的另一件事是使用(doall (line-seq ...))，它將整個單詞列表讀入內存。如果列表非常大，這可能會導致問題。積累這種數據的便利技巧是使用reduce。在你的情況下，我們需要兩次：reduce：一遍一遍，然後遍歷每行的單詞。是這樣的：

(defn parse-line 
    [line] 
    (str/split (str/lower-case line) #"\s+")) 

(defn build-word-freq 
    [file] 
    (with-open [rdr (io/reader file)] 
    (reduce (fn [accum line] 
       (let [[spam-or-ham & words] (parse-line line)] 
       (reduce #(update-in %1 [spam-or-ham %2] (fnil inc 0)) accum words))) 
      {} 
      (line-seq rdr))))

來源

2013-06-29 14:09:06 ray1729

Clojure的頻率詞典

回答

相關問題