Unescaping HTML實體（包括已命名的）

此問題類似於之前在Stack Overflow中詢問的Remove html character entities in a string問題。 然而，接受的答案並未解決命名的HTML實體的問題，例如， ä爲字符ä;因此它不能忽略所有的HTML。Unescaping HTML實體（包括已命名的）

我有一些傳統的HTML使用非ASCII字符命名的HTML實體。即ö而不是ö,ä而不是ä等等。 A full list of all named HTML entities已在維基百科上使用。

我想快速高效地將這些HTML實體放到它們的字符等價物中。

我的代碼在Python 3要做到這一點，使用正則表達式：

import re 
import html.entities 

s = re.sub(r'&(\w+?);', lambda m: chr(html.entities.name2codepoint[m.group(1)]), s)

正則表達式然而，似乎不是很流行，快或容易在Haskell使用。

Text.HTML.TagSoup.Entity（tagsoup）有一個有用的表和函數映射命名實體TPO碼點。利用這一點，和正則表達式，TDFA包，我一直在塑造一個哈斯克爾非常緩慢相當於：

{-# LANGUAGE OverloadedStrings #-} 
import Data.ByteString.Lazy.Char8 as L 
import Data.ByteString.Lazy.UTF8 as UTF8 
import Text.HTML.TagSoup.Entity (lookupEntity) 
import Text.Regex.TDFA ((=~~)) 

unescapeEntites :: L.ByteString -> L.ByteString 
unescapeEntites = regexReplaceBy "&#?[[:alnum:]]+;" $ lookupMatch 
where 
    lookupMatch m = 
    case lookupEntity (L.unpack . L.tail . L.init $ m) of 
     Nothing -> m 
     Just x -> UTF8.fromString [x] 

-- regex replace taken from http://mutelight.org/articles/generating-a-permalink-slug-in-haskell 
regexReplaceBy :: L.ByteString -> (L.ByteString -> L.ByteString) -> L.ByteString -> L.ByteString 
regexReplaceBy regex f text = go text [] 
where 
    go str res = 
    if L.null str 
     then L.concat . reverse $ res 
     else 
     case (str =~~ regex) :: Maybe (L.ByteString, L.ByteString, L.ByteString) of 
      Nothing -> L.concat . reverse $ (str : res) 
      Just (bef, match , aft) -> go aft (f match : bef : res)

的unescapeEntities函數運行幅度比上面的Python版本慢幾個數量級。 Python代碼可以在7秒內轉換大約130 MB，而我的Haskell版本運行幾分鐘。

我在尋找更好的解決方案，主要是在速度方面。但是，如果可能的話，我還想避免使用正則表達式（速度和避免正則表達式似乎與Haskell結伴而行）。

來源

2011-07-27 vicvicvic

目前尚不清楚你的實際問題在這裏。你在尋找更好的解決方案嗎？想幫助改進目前的？ –

對不起，如果問題不清楚。是的，我想要一個更好的解決方案，因爲我的速度太慢了2.使用正則表達式，看起來不像Haskell慣用的（給定的信息很少有關於它們的信息）。我的解決方案主要是作爲「這是我目前爲止」的出發點。我樂意接受激進的想法。 – vicvicvic

你如何閱讀文件？如果我使'main = Data.ByteString.interact unescapeEntites'並做'time cat big.txt | ./regex >>/dev/null'對於143M大的我會得到30秒。txt（TagSoup中列出的所有實體都有很多'a'穿插）。從所有這些間接方面仍然笨重，但不是幾分鐘。 – applicative

這是我的版本。它使用String（而不是ByteString）。

import Text.HTML.TagSoup.Entity (lookupEntity) 

unescapeEntities :: String -> String 
unescapeEntities [] = [] 
unescapeEntities ('&':xs) = 
    let (b, a) = break (== ';') xs in 
    case (lookupEntity b, a) of 
    (Just c, ';':as) -> c : unescapeEntities as  
    _    -> '&' : unescapeEntities xs 
unescapeEntities (x:xs) = x : unescapeEntities xs

我想這是更快，因爲它不使用昂貴的正則表達式的操作。我沒有測試過它。如果您更快需要，可以將它調整爲ByteString或Data.Text。

來源

2011-08-29 17:36:21

您可以安裝web編碼包，獲取decodeHtml函數的源代碼並添加您需要的字符（適用於我）。這是你需要的全部：

import Data.Maybe 
import qualified Web.Encodings.StringLike as SL 
import Web.Encodings.StringLike (StringLike) 
import Data.Char (ord) 

-- | Decode HTML-encoded content into plain content. 
-- 
-- Note: this does not support all HTML entities available. It also swallows 
-- all failures. 
decodeHtml :: StringLike s => s -> s 
decodeHtml s = case SL.uncons s of 
    Nothing -> SL.empty 
    Just ('&', xs) -> fromMaybe ('&' `SL.cons` decodeHtml xs) $ do 
     (before, after) <- SL.breakCharMaybe ';' xs 
     c <- case SL.unpack before of -- this are small enough that unpack is ok 
      "lt" -> return '<' 
      "gt" -> return '>' 
      "amp" -> return '&' 
      "quot" -> return '"' 
      '#' : 'x' : hex -> readHexChar hex 
      '#' : 'X' : hex -> readHexChar hex 
      '#' : dec -> readDecChar dec 
      _ -> Nothing -- just to shut up a warning 
     return $ c `SL.cons` decodeHtml after 
    Just (x, xs) -> x `SL.cons` decodeHtml xs 

readHexChar :: String -> Maybe Char 
readHexChar s = helper 0 s where 
    helper i "" = return $ toEnum i 
    helper i (c:cs) = do 
     c' <- hexVal c 
     helper (i * 16 + c') cs 

hexVal :: Char -> Maybe Int 
hexVal c 
    | '0' <= c && c <= '9' = Just $ ord c - ord '0' 
    | 'A' <= c && c <= 'F' = Just $ ord c - ord 'A' + 10 
    | 'a' <= c && c <= 'f' = Just $ ord c - ord 'a' + 10 
    | otherwise = Nothing 

readDecChar :: String -> Maybe Char 
readDecChar s = do 
    case reads s of 
     (i, _):_ -> Just $ toEnum (i :: Int) 
     _ -> Nothing

雖然我沒有測試性能。但它可能是一個很好的例子，如果你也可以不使用正則表達式。

來源

2011-07-28 06:52:36 firefrorefiddle

Unescaping HTML實體（包括已命名的）

回答

相關問題