使用Text.Regex.PCRE缺少字符來解析網頁標題

我最近做了一個網站，需要從TED網站檢索會話標題。使用Text.Regex.PCRE缺少字符來解析網頁標題

到目前爲止，問題是具體到這次談話：Francis Collins: We need better drugs -- now

從網頁源，我得到：現在

<title>Francis Collins: We need better drugs -- now | Video on TED.com</title> 
<span id="altHeadline" >Francis Collins: We need better drugs -- now</span>

，在ghci中，我嘗試這樣做：

λ> :m +Network.HTTP Text.Regex.PCRE 
λ> let uri = "http://www.ted.com/talks/francis_collins_we_need_better_drugs_now.html" 
λ> body <- (simpleHTTP $ getRequest uri) >>= getResponseBody 
λ> body =~ "<span id=\"altHeadline\" >(.+)</span>" :: [[String]] 
[["id=\"altHeadline\" >Francis Collins: We need better drugs -- now</span>\n\t\t</h","s Collins: We need better drugs -- now</span"]] 
λ> body =~ "<title>(.+)</title>" :: [[String]] 
[["tle>Francis Collins: We need better drugs -- now | Video on TED.com</title>\n<l","ncis Collins: We need better drugs -- now | Video on TED.com</t"]]

無論採用哪種方式，解析標題都會遺漏左側的一些字符，而右側會出現一些意想不到的字符。這似乎與談話題目中的--有關。但是，

λ> let body' = "<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>" 
λ> body' =~ "<title>(.+)</title>" :: [[String]] 
[["<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>","Francis Collins: We need better drugs -- now | Video on TED.com"]]

幸運的是，這不是Text.Regex.Posix的問題。

λ> import qualified Text.Regex.Posix as P 
λ> body P.=~ "<title>(.+)</title>" :: [[String]] 
[["<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>","Francis Collins: We need better drugs -- now | Video on TED.com"]]

來源

2013-03-27 rnons

更改。+到。+？ – 2013-03-27 11:56:25

@BenHanson同樣的結果。 – rnons 2013-03-27 12:02:20

我的建議是：不要使用正則表達式來解析HTML。改爲使用適當的HTML分析器。下面是一個使用html-conduit解析器和xml-conduit遊標庫（以及http-conduit進行下載）的示例。

{-# LANGUAGE OverloadedStrings #-} 
import   Data.Monoid   (mconcat) 
import   Network.HTTP.Conduit (simpleHttp) 
import   Text.HTML.DOM  (parseLBS) 
import   Text.XML.Cursor  (attributeIs, content, element, 
             fromDocument, ($//), (&//), (>=>)) 

main = do 
    lbs <- simpleHttp "http://www.ted.com/talks/francis_collins_we_need_better_drugs_now.html" 
    let doc = parseLBS lbs 
     cursor = fromDocument doc 
    print $ mconcat $ cursor $// element "title" &// content 
    print $ mconcat $ cursor $// element "span" >=> attributeIs "id" "altHeadline" &// content

該代碼也可用active code on the School of Haskell。

來源

2013-03-27 12:25:11

謝謝你的建議。我知道在haskell編程時，總是有多種方法來解決同一個問題。但作爲初學者，我僅僅滿足於工作代碼。我肯定會在重構時接受你的建議。 – rnons 2013-03-27 13:10:27

這不是Haskell特定的建議。使用正則表達式進行HTML/XML解析通常不是一個好主意。看看：http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – 2013-03-27 14:09:22

使用Text.Regex.PCRE缺少字符來解析網頁標題

回答

相關問題