haskell convert unicode sequence to utf 8

2019-09-10 08:07发布

问题:

I am working on http client in haskell (that's my first "non exersize" project).

There is an api which returns json with all text using unicode, something like

\u041e\u043d\u0430 \u043f\u0440\u0438\u0432\u0435\u0434\u0435\u0442 \u0432\u0430\u0441 \u0432 \u0434\u043b\u0438\u043d\u043d\u044b\u0439 \u0441\u043f\u0438\u0441\u043e\u043a

I want to decode this json to utf-8, to print some data from json message.

I searched for existing libraries, but find Nothing for this purpose.

So I wrote function to convert data (I am using lazy bytestrings because I got data with this type from wreq lib)

ununicode :: BL.ByteString -> BL.ByteString 
ununicode s = replace s where

    replace :: BL.ByteString -> BL.ByteString
    replace str = case (Map.lookup (BL.take 6 str) table) of
              (Just x) -> BL.append x (replace $ BL.drop 6 str)
              (Nothing) -> BL.cons (BL.head str)  (replace $ BL.tail str)

      table = Map.fromList $ zip letters rus

      rus = ["Ё", "ё", "А", "Б", "В", "Г", "Д", "Е", "Ж", "З", "И", "Й", "К", "Л", "М",
             "Н", "О", "П", "Р", "С", "Т", "У", "Ф", "Х", "Ц", "Ч", "Ш", "Щ", "Ъ", "Ы",
             "Ь", "Э", "Ю", "Я", "а", "б", "в", "г", "д", "е", "ж", "з", "и", "й", "к",
             "л", "м", "н", "о", "п", "р", "с", "т", "у", "ф", "х", "ц", "ч", "ш", "щ",
             "ъ", "ы", "ь", "э", "ю", "я"] 

      letters = ["\\u0401", "\\u0451", "\\u0410", "\\u0411", "\\u0412", "\\u0413", 
                 "\\u0414", "\\u0415", "\\u0416", "\\u0417", "\\u0418", "\\u0419",
                 "\\u041a", "\\u041b", "\\u041c", "\\u041d", "\\u041e", "\\u041f",
                 "\\u0420", "\\u0421", "\\u0422", "\\u0423", "\\u0424", "\\u0425",
                 "\\u0426", "\\u0427", "\\u0428", "\\u0429", "\\u042a", "\\u042b",
                 "\\u042c", "\\u042d", "\\u042e", "\\u042f", "\\u0430", "\\u0431",
                 "\\u0432", "\\u0433", "\\u0434", "\\u0435", "\\u0436", "\\u0437",
                 "\\u0438", "\\u0439", "\\u043a", "\\u043b", "\\u043c", "\\u043d",
                 "\\u043e", "\\u043f", "\\u0440", "\\u0441", "\\u0442", "\\u0443",
                 "\\u0444", "\\u0445", "\\u0446", "\\u0447", "\\u0448", "\\u0449",
                 "\\u044a", "\\u044b", "\\u044c", "\\u044d", "\\u044e", "\\u044f"]

But it doesn't work as I expected. It replaces text, but instead of cyrrilic letters I got something like 345 ?C1;8:C5< 8=B5@2LN A @4=52=8:>2F0<8 8=B5@5A=KE ?@>D5AA89 8 E>118

The second problem that I can't debug my function. When I try just call it with custom string I got error Data.ByteString.Lazy.head: empty ByteString I gave no idea about reason why it's empty.

It work's fine during normal program execution:

umailGet env params = do
    r <- apiGet env (("method", "umail.get"):params)
    x <- return $ case r of
          (Right a) -> a
          (Left a)  -> ""
    return $ ununicode $ x

and than in Main

  r2 <- umailGet client []
  print $  r2

And the last problem is that all api can return any unicode symbol, so this solution is bad by design.

Of course function implementation seems to be bad to, so after solving the main problem, I am going to rewrite it using foldr.

UPDATED: It seems like I had desribed problem not enough clear.

So I am sending request via wreq lib, and get a json answer. For example

{"result":"12","error":"\u041d\u0435\u0432\u0435\u0440\u043d\u044b\u0439 \u0438\u0434\u0435\u043d\u0442\u0438\u0444\u0438\u043a\u0430\u0442\u043e\u0440 \u0441\u0435\u0441\u0441\u0438\u0438"}

That's not the result of haskell representetion of result, thare are real ascii symbols. I got the same text using curl or firefox. 190 bytes/190 ascii symbols.

Using this site for example http://unicode.online-toolz.com/tools/text-unicode-entities-convertor.php I can convert it to cyrrilic text {"result":"12","error":"Неверный идентификатор сессии"}

And I need to implement something like this service using haskell (or find a package where it had been already implemented), where response like this has type Lazy Bytestring.

I also tried to change types to use Text instead of ByteString (both Lazy and strict), changed first line to ununicode s = encodeUtf8 $ replace $ L.toStrict $ LE.decodeUtf8 s

And with that new implementation I am getting an error when executing my program Data.Text.Internal.Fusion.Common.head: Empty stream. Sot it looks like I have error in my replacing function, maybe if I fix it, it also will fix the main problem.

回答1:

I am not sure if you are falling in the "print unicode" trap (see here) - for en/decoding there already exists hackage: Data.Text.Encoding decodeUtf8 :: ByteString -> Text and encodeUtf8 :: Text -> ByteString should do the task.

Edit:

I have played around with text/bytestring for some time to reproduce your "\u1234" characters - well i couldn't

{-# LANGUAGE OverloadedStrings #-}

module Main where

import           Data.Text (Text)
import qualified Data.Text.Encoding as E
import qualified Data.Text.IO as T
import           Data.ByteString (ByteString)
import qualified Data.ByteString.Char8 as B


inputB :: ByteString
inputB = "ДЕЖЗИЙКЛМНОПРСТУФ"

inputT :: Text
inputT = "ДЕЖЗИЙКЛМНОПРСТУФ"


main :: IO ()
main = do putStr "T.putStrLn inputT: "                ; T.putStrLn inputT
          putStr "B.putStrLn inputB: "                ; B.putStrLn inputB
          putStr "print inputB: "                     ; print inputB
          putStr "print inputT: "                     ; print inputT
          putStr "B.putStrLn $ E.encodeUtf8 inputT: " ; B.putStrLn $ E.encodeUtf8 inputT
          putStr "T.putStrLn $ E.decodeUtf8 inputB: " ; T.putStrLn $ E.decodeUtf8 inputB
          putStr "print $ E.decodeUtf8 inputB: "      ; print $ E.decodeUtf8 inputB
          putStr "print $ E.encodeUtf8 inputT: "      ; print $ E.encodeUtf8 inputT

here is the result of it:

T.putStrLn inputT: ДЕЖЗИЙКЛМНОПРСТУФ
B.putStrLn inputB:
rint inputB: "\DC4\NAK\SYN\ETB\CAN\EM\SUB\ESC\FS\GS\RS\US !\"#$"
print inputT: "\1044\1045\1046\1047\1048\1049\1050\1051\1052\1053\1054\1055\1056\1057\1058\1059\1060"
B.putStrLn $ E.encodeUtf8 inputT: ДЕЖЗИЙКЛМНОПРСТУФ
T.putStrLn $ E.decodeUtf8 inputB:
rint $ E.decodeUtf8 inputB: "\DC4\NAK\SYN\ETB\CAN\EM\SUB\ESC\FS\GS\RS\US !\"#$"
print $ E.encodeUtf8 inputT: "\208\148\208\149\208\150\208\151\208\152\208\153\208\154\208\155\208\156\208\157\208\158\208\159\208\160\208\161\208\162\208\163\208\164"

honestly I don't know why I get the "rint" lines after the bytestring printlines that yield no result.