用Haskell中的TagSoup解析標籤

我一直在試圖學習如何從Haskell中的HTML文件中提取數據，並且遇到了困難。我根本沒有真正的Haskell經驗，我以前的知識來自Python（和BeatifulSoup for HTML解析）。用Haskell中的TagSoup解析標籤

我正在使用TagSoup來看看我的HTML（似乎是推薦），並有一個它如何工作的基本思路。下面是我的代碼有問題的基本段（自包含的，用於測試輸出信息）：

import System.IO 
import Network.HTTP 
import Text.HTML.TagSoup 
import Data.List 

main :: IO() 
main = do 
    http <- simpleHTTP (getRequest "http://www.cbssports.com/nba/scoreboard/20130310") >>= getResponseBody 
    let tags = dropWhile (~/= TagOpen "div" []) (parseTags http) 
    done tags where 
     done xs = case xs of 
      [] -> putStrLn $ "\n" 
      _ -> do 
       putStrLn $ show $ head xs 
       done (tail xs)

不過，我不試圖去任何「分區」標籤。我想放棄之前的一切標籤的格式如下：

TagOpen "div" [("id","scores-1997830"),("class","scoreBox spanCol2")] 
TagOpen "div" [("id","scores-1997831"),("class","scoreBox spanCol2 lastCol")]

我試着寫出來：

let tags = dropWhile (~/= TagOpen "div" [("id", "scores-[0-9]+"), ("class", "scoreBox(spanCol[0-9]?)+(lastCol)?")]) (parseTags http)

但隨後試圖找到字面[0-9] +。我還沒有想出Text.Regex.Posix模塊的解決方法，並且轉義字符不起作用。這裏有什麼解決方案？

來源

2013-03-16 simonsays

~==沒有做正則表達式，你必須寫一個匹配自己的東西沿着

import Data.Maybe 
import Text.Regex 

goodTag :: TagOpen -> Bool 
goodTag tag = tag ~== TagOpen "div" [] 
    && fromAttrib "id" tag `matches` "scores-[0-9]+" 

-- Just a wrapper around Text.Regex.matchRegex 
matches :: String -> String -> Bool 
matches string regex = isJust $ mkRegex regex `matchRegex` string

來源

2013-03-17 00:26:12 Koterpillar

行怎麼樣'fromAttrib「身份證」標籤=〜「scores- [0-9] + 「'？ – 2013-03-17 15:28:15

謝謝，夥計們！這兩個工作。我不確定哪個「更好」，但是由於我想盡可能多地寫出代碼（爲了學習的目的，請不要擔心），我現在只需要Koterpillar的方法。謝謝一堆！ – simonsays 2013-03-17 18:36:23

用Haskell中的TagSoup解析標籤

回答

相關問題