2011-07-19 47 views
5

我很欣賞關於如何利用Clojure高效地分析和比較兩個文件的建議/見解。有兩個(日誌)文件包含員工考勤;從這些文件中我需要確定兩名員工在同一時間工作的所有日子,在同一部門。以下是日誌文件的示例。如何解析和比較文件?

注意:每個文件具有不同數量的條目。

第一個文件:

Employee Id  Name   Time In   Time Out   Dept. 
mce0518   Jon  2011-01-01 06:00 2011-01-01 14:00  ER 
mce0518   Jon  2011-01-02 06:00 2011-01-01 14:00  ER 
mce0518   Jon  2011-01-04 06:00 2011-01-01 13:00  ICU 
mce0518   Jon  2011-01-05 06:00 2011-01-01 13:00  ICU 
mce0518   Jon  2011-01-05 17:00 2011-01-01 23:00  ER 

第二個文件:

Employee Id  Name   Time In   Time Out   Dept. 
pdm1705   Jane  2011-01-01 06:00 2011-01-01 14:00  ER 
pdm1705   Jane  2011-01-02 06:00 2011-01-01 14:00  ER 
pdm1705   Jane  2011-01-05 06:00 2011-01-01 13:00  ER 
pdm1705   Jane  2011-01-05 17:00 2011-01-01 23:00  ER 

回答

3

,如果你不打算定期做,

 

(defn data-seq [f] 
    (with-open [rdr (java.io.BufferedReader. 
        (java.io.FileReader. f))] 
    (let [s (rest (line-seq rdr))] 
     (doall (map seq (map #(.split % "\\s+") s)))))) 

(defn same-time? [a b] 
    (let [a (drop 2 a) 
     b (drop 2 b)] 
    (= a b))) 

(let [f1 (data-seq "f1.txt") 
     f2 (data-seq "f2.txt")] 

    (reduce (fn[h v] 
      (let [f2 (filter #(same-time? v %) f2)] 
       (if (empty? f2) 
       h 
       (conj h [(first v) (map first f2)])))) [] f1) 
) 
 

會得到你,

[["mce0518" ("pdm1705")] ["mce0518" ("pdm1705")] ["mce0518" ("pdm1705")]] 
+0

函數名稱相同的時間?由於它既檢查時間又檢查部門,有點欺騙性。 –

+0

是的,我打算在兩個電話中處理比較,但後來想出了下降2然後=會做。 –

1

我纔有點短,(恕我直言)更可讀的版本

(use ; moar toolz - moar fun 
    '[clojure.contrib.duck-streams :only (reader)] 
    '[clojure.string :only (split)] 
    '[clojure.contrib.str-utils :only (str-join)] 
    '[clojure.set :only (intersection)]) 

(defn read-presence [filename] 
    (with-open [rdr (reader filename)] ; file will be securely (always) closed after use 
    (apply hash-set ; make employee's hash-set 
     (map #(str-join "--" (drop 2 (split % #" [ ]+"))) ; right-to-left: split row by spaces then forget two first columns then join using "--" 
     (drop 1 ; ommit first line 
      (line-seq rdr)))))) ; read file content line-by-line 

(intersection (read-presence "a.in") (read-presence "b.in")) ; now it's simple! 
;result: #{"2011-01-01 06:00--2011-01-01 14:00--ER" "2011-01-02 06:00--2011-01-01 14:00--ER" "2011-01-05 17:00--2011-01-01 23:00--ER"} 

假設a.inb.in是您的文件。我還假設你有一個哈希設定爲每一位員工 - (幼稚)推廣到N的員工將需要六行:

(def employees ["greg.txt" "allison.txt" "robert.txt" "eric.txt" "james.txt" "lisa.txt"]) 
(for [a employees b employees :when (and 
             (= a (first (sort [a b]))) ; thou shall compare greg with james ONCE 
             (not (= a b)))] ; thou shall not compare greg with greg 
    (str-join " -- " ; well, it's not pretty... nor pink at least 
    [a b (intersection (read-presence a) (read-presence b))])) 
;result: ("a.in -- b.in -- #{\"2011-01-01 06:00--2011-01-01 14:00--ER\" \"2011-01-02 06:00--2011-01-01 14:00--ER\" \"2011-01-05 17:00--2011-01-01 23:00--ER\"}") 

其實這個循環是SOOO醜陋,它不記住中間結果...待改進。

- 編輯 -

我知道必須有一些優雅的核心或貢獻!

(use '[clojure.contrib.combinatorics :only (combinations)]) 

(def employees ["greg.txt" "allison.txt" "robert.txt" "eric.txt" "james.txt" "lisa.txt"]) 
(def employee-map (apply conj (for [e employees] {e (read-presence e)}))) 
(map (fn [[a b]] [a b (intersection (employee-map a) (employee-map b))]) 
    (combinations employees 2)) 
;result: (["a.in" "b.in" #{"2011-01-01 06:00--2011-01-01 14:00--ER" "2011-01-02 06:00--2011-01-01 14:00--ER" "2011-01-05 17:00--2011-01-01 23:00--ER"}]) 

現在,它的記憶(在員工地圖解析數據),一般...懶:d