2016-05-22 97 views
0

我正在從函數將javascript編碼的unicode字符串從服務器api轉換爲utf8。Haskell快速文本處理

我有2種方法。第一個不夠一般,第二個太慢。 我該如何做到這一點快速和所有的Unicode符號?

首先是剛剛更換使用地圖

ununicode :: BL.ByteString -> BL.ByteString    
ununicode s = LE.encodeUtf8 $ replace $ LE.decodeUtf8 s where 
    replace :: L.Text -> L.Text 
    replace "" = "" 
    replace string = case Map.lookup (L.take 6 string) table of 
      (Just x) -> L.append x (replace $ L.drop 6 string) 
      Nothing -> L.cons (L.head string) (replace $ L.tail string) 

    table = Map.fromList $ zip letters rus 

    rus = ["Ё", "ё", "А", "Б", "В", "Г", "Д", "Е", "Ж", "З", "И", "Й", "К", "Л", "М", 
     "Н", "О", "П", "Р", "С", "Т", "У", "Ф", "Х", "Ц", "Ч", "Ш", "Щ", "Ъ", "Ы", 
     "Ь", "Э", "Ю", "Я", "а", "б", "в", "г", "д", "е", "ж", "з", "и", "й", "к", 
     "л", "м", "н", "о", "п", "р", "с", "т", "у", "ф", "х", "ц", "ч", "ш", "щ", 
     "ъ", "ы", "ь", "э", "ю", "я", "—"] :: [L.Text] 

    letters = ["\\u0401", "\\u0451", "\\u0410", "\\u0411", "\\u0412", "\\u0413", 
      "\\u0414", "\\u0415", "\\u0416", "\\u0417", "\\u0418", "\\u0419", 
      "\\u041a", "\\u041b", "\\u041c", "\\u041d", "\\u041e", "\\u041f", 
      "\\u0420", "\\u0421", "\\u0422", "\\u0423", "\\u0424", "\\u0425", 
      "\\u0426", "\\u0427", "\\u0428", "\\u0429", "\\u042a", "\\u042b", 
      "\\u042c", "\\u042d", "\\u042e", "\\u042f", "\\u0430", "\\u0431", 
      "\\u0432", "\\u0433", "\\u0434", "\\u0435", "\\u0436", "\\u0437", 
      "\\u0438", "\\u0439", "\\u043a", "\\u043b", "\\u043c", "\\u043d", 
      "\\u043e", "\\u043f", "\\u0440", "\\u0441", "\\u0442", "\\u0443", 
      "\\u0444", "\\u0445", "\\u0446", "\\u0447", "\\u0448", "\\u0449", 
      "\\u044a", "\\u044b", "\\u044c", "\\u044d", "\\u044e", "\\u044f", 
      "\\u2014"] :: [L.Text] 

一些子而在第二個我用與foldl功能有限自動機。 (我想用正則表達式,但沒有發現任何的lib支持功能,而不是在子串像蟒蛇https://docs.python.org/3/library/re.html#re.sub

ununicode :: BL.ByteString -> BL.ByteString    
ununicode s = LE.encodeUtf8 $ parts $ LE.decodeUtf8 s where 

    parts :: L.Text -> L.Text 
    parts = fst . parts' where 
     lst (_, _, x) = x 
     snd (_, x, _) = x 
     fst (x, _, _) = x 
     parts' :: L.Text -> (L.Text, Integer, L.Text) 
     parts' = L.foldl f ("", 0, "") where 
      f :: (L.Text, Integer, L.Text) -> Char -> (L.Text, Integer, L.Text) 
      f p n | snd p == 0 = case n of 
        ('\\') -> (fst p, 2, lst p) 
        (x) -> (L.singleton n, 1, lst p) 
       | snd p == 1 = case n of 
        ('\\') -> (fst p, 2, lst p) 
        (x) -> ((fst p) `L.snoc` n, 1, lst p) 
       | snd p == 2 = case n of 
        ('u') -> (fst p, 3, lst p) 
        x  -> ((L.snoc (L.snoc (fst p) 
               '\\') 
             n), 
           1, 
           lst p) 
       | snd p == 3 = proc p n 
      proc :: (L.Text, Integer, L.Text) -> Char -> (L.Text, Integer, L.Text) 
      proc (text, 3, buff) n | isHexDigit n   = (text, 3, buff `L.snoc` n) 
           | (len > 3) && (len < 6) = (L.append text 
                     (replacedChoice buff n), 
                  if n == '\\' then 2 else 1, 
                  L.empty) 
           | otherwise    = (L.append text 
                     (L.append "\\u" 
                        (choice buff n)), 
                  if n == '\\' then 2 else 1, 
                  L.empty) where 
            len = L.length buff 
            choice b n = if n == '\\' then b else L.snoc b n 
            replacedChoice b n = if n == '\\' 
                 then repl b 
                 else L.snoc (repl b) n 

    repl :: L.Text -> L.Text 
    repl "" = "" 
    repl s = (\v -> case v of 
     (Right x) -> L.singleton $ (\t -> toEnum t :: Char) $ fst x 
     (Left x) -> error $ "impossible" ++ (show x)) (hexadecimal s) 

我只是跑我的測試(全部是非常快的,除了這一點),看時間。 喜歡的東西

it "converts url encoded string" $ do 
     (ununicode $ BL.pack $ concat $ replicate 1000 ("error:\\u041d\\u0435\\u0432\\u0435\\u0440\\u043d\\u044b\\u0439\\u100cc" :: String)) 
     `shouldBe` 
     ("error:\208\157\208\181\208\178\208\181\209\128\208\189\209\139\208\185") 

它需要0.5秒用於第一執行和從5至15秒。這太多了。

如何使我的第二個算法與第一個一樣快(或者如果可能,甚至更快)?

回答

2

Haskell中的快速字符串處理有些黑色藝術。

您是否試圖與aeson軟件包中的jstring函數做同樣的事情?

https://hackage.haskell.org/package/aeson-0.11.2.0/docs/Data-Aeson-Parser.html#v:jstring

即使它不完全是一回事,你可以通過查看其執行得到一些想法。它使用attoparsec下面有嚴格和懶惰的版本。

+0

謝謝。我得到這個編碼的字符串使用aeson作爲懶惰ByteString,但我錯過了這個功能。我現在看了看源代碼,看起來像做我需要的東西。 – Aprectu