解析HTML並使用clojure從解析的值構建映射

我正在使用enjive clojure來解析HTML。我的解析器看起來像;解析HTML並使用clojure從解析的值構建映射

(def each-rows 
    (for [tr crawl-page 
     :let [row (html/select tr [:td (attr= :class "bl_12")])] 
     :when (seq row)] 
    row))

其中提取結果如下;

{:tag :a, 
    :attrs 
    {:class "bl_12", 
    :href 
    "url1"}, 
    :content ("Chapter 1")} 
    {:tag :a, 
    :attrs 
    {:class "bl_12", 
    :href 
    "url2"}, 
    :content ("Chapter 2")} 
    {:tag :a, 
    :attrs 
    {:class "bl_12", 
    :href 
    "url3"}, 
    :content ("Chapter 3")}

現在我的目標是得到這樣一本字典;

{:Chapter_1 "url1" 
    :Chapter_2 "url2" 
    :Chapter_3 "url3"}

我設法寫僅提取HREF或只是內容的方法，但不能讓它作爲一個地圖

(defn read-specific-other [x] 
    (map (comp second :attrs) x))

輸出：[:href "url1"]

(defn read-specific-content [x] 
    (map (comp first ::content) x))

（圖讀 - 特定內容每行）

輸出：

(("Chapter 1" 
"Chapter 2" 
"Chapter 3" 
))

如何得到期望的結果

來源

2015-12-27 Abhishek Choudhary

嗨，我使用Clojure的解析XML考慮。你選擇Clojure是因爲（a）效率更高，或者（b）它是你剛剛使用的語言？ –

我正在研究抓取和數據處理和clojure是非常強大和自然做這樣的事情，管道是超高效的（線程宏），我選擇clojure爲其數據處理的自然力量 –

看看zipmap

(zipmap (read-specific-other each-rows) (read-specific-content each-rows))

如果你真的想要的鑰匙，是關鍵字，然後使用keyword功能;但我建議保留字符串作爲鍵。

也可以考慮使用into for模式來代替：

(into {} 
    (for [[{:keys [attrs]} {:keys [content]}] rows] 
    [content attrs]))

來源

2015-12-28 03:20:11

zipmap做了魔術，謝謝 –

解析HTML並使用clojure從解析的值構建映射

回答

相關問題