我最近做了一個網站,需要從TED網站檢索會話標題。使用Text.Regex.PCRE缺少字符來解析網頁標題
到目前爲止,問題是具體到這次談話:Francis Collins: We need better drugs -- now
從網頁源,我得到:現在
<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>
<span id="altHeadline" >Francis Collins: We need better drugs -- now</span>
,在ghci中,我嘗試這樣做:
λ> :m +Network.HTTP Text.Regex.PCRE
λ> let uri = "http://www.ted.com/talks/francis_collins_we_need_better_drugs_now.html"
λ> body <- (simpleHTTP $ getRequest uri) >>= getResponseBody
λ> body =~ "<span id=\"altHeadline\" >(.+)</span>" :: [[String]]
[["id=\"altHeadline\" >Francis Collins: We need better drugs -- now</span>\n\t\t</h","s Collins: We need better drugs -- now</span"]]
λ> body =~ "<title>(.+)</title>" :: [[String]]
[["tle>Francis Collins: We need better drugs -- now | Video on TED.com</title>\n<l","ncis Collins: We need better drugs -- now | Video on TED.com</t"]]
無論採用哪種方式,解析標題都會遺漏左側的一些字符,而右側會出現一些意想不到的字符。這似乎與談話題目中的--
有關。但是,
λ> let body' = "<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>"
λ> body' =~ "<title>(.+)</title>" :: [[String]]
[["<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>","Francis Collins: We need better drugs -- now | Video on TED.com"]]
幸運的是,這不是Text.Regex.Posix
的問題。
λ> import qualified Text.Regex.Posix as P
λ> body P.=~ "<title>(.+)</title>" :: [[String]]
[["<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>","Francis Collins: We need better drugs -- now | Video on TED.com"]]
更改。+到。+? – 2013-03-27 11:56:25
@BenHanson同樣的結果。 – rnons 2013-03-27 12:02:20