Script to convert ASCII chars to “” unicode

2019-05-22 22:38发布

问题:

I'm doing some changes in Linux locale files /usr/share/i18n/locales (like pt_BR), and it's required that format strings (like %d-%m-%Y %H:%M) must be specified in Unicode, where each (in this case, ASCII) character is represented as <U00xx>.

So a text like this:

LC_TIME
d_t_fmt "%a %d %b %Y %T %Z"
d_fmt   "%d-%m-%Y"
t_fmt   "%T"

Must be:

LC_TIME
d_t_fmt "<U0025><U0061><U0020><U0025><U0064><U0020><U0025><U0062><U0020><U0025><U0059><U0020><U0025><U0054><U0020><U0025><U005A>"
d_fmt   "<U0025><U0064><U002D><U0025><U006D><U002D><U0025><U0059>"
t_fmt   "<U0025><U0054>"

Thus I need a command-line script (be it bash, Python, Perl, or something else) that would take an input like %d-%m-%Y and convert it to <U0025><U0064><U002D><U0025><U006D><U002D><U0025><U0059>.

All characters in the input string would be ASCII chars (from 0x20 to 0x7F), so this is actually a fancier "char-to-hex-string" conversion.

Could anyone please help me? My skills in bash scripting are very limited, and even worse in Python.

Bonus for elegant, explained solutions.

Thanks!

(by the way, this would be the "reverse" script for my previous question)

回答1:

Every char with file input

If you wanted to convert every character of a file to the unicode representation, then it would be this simple one-liner

while IFS= read -r -n1 c;do printf "<U%04X>" "'$c"; done < ./infile

Every char on STDIN

If you want to make a unix-like tool which converts input on STDIN to unicode-like output, then use this:

uni(){ c=$(cat); for((i=0;i<${#c};i++)); do printf "<U%04X>" "'${c:i:1}"; done; }

Proof of Concept

$ echo "abc" | uni
<U0061><U0062><U0063>

Only chars between double-quotes

#!/bin/bash

flag=0
while IFS= read -r -n1 c; do
    if [[ "$c" == '"' ]]; then
        ((flag^=1))
        printf "%c" "$c"
    elif [[ "$c" == $'\0' ]]; then
        echo
    elif ((flag)); then
        printf "<U%04X>" "'$c"
    else
        printf "%c" "$c"
    fi
done < /path/to/infile

Proof of Concept

$ cat ./unime
LC_TIME
d_t_fmt "%a %d %b %Y %T %Z"
d_fmt   "%d-%m-%Y"
t_fmt   "%T"
abday "Dom";"Seg";/
here is a string with "multiline
quotes";/

$ ./uni.sh
LC_TIME
d_t_fmt "<U0025><U0061><U0020><U0025><U0064><U0020><U0025><U0062><U0020><U0025><U0059><U0020><U0025><U0054><U0020><U0025><U005A>"
d_fmt   "<U0025><U0064><U002D><U0025><U006D><U002D><U0025><U0059>"
t_fmt   "<U0025><U0054>"
abday "<U0044><U006F><U006D>";"<U0053><U0065><U0067>";/
here is a string with "<U006D><U0075><U006C><U0074><U0069><U006C><U0069><U006E><U0065>
<U0071><U0075><U006F><U0074><U0065><U0073>";/

Explanation

Pretty simply really

  1. while IFS= read -r -n1 c;: Iterate over the input one character at a time (via -n1) and store the char in the variable c. The IFS= and -r flags are there so that the read builtin doesn't try to do word splitting or interpret escape sequences, respectively.
  2. if [[ "$c" == '"' ]];: If the current char is a double-quote
  3. ((flag^=1)): Invert the value of flag from 0->1 or 1->0
  4. elif [[ "$c" == $'\0' ]];: If the current char is a NUL, then echo a newline
  5. elif ((flag)): If flag is 1, then perform unicode transliteration
  6. printf "<U%04X>" "'$c": The magic that does the unicode transliteration. Note that the single-quote before the $c is mandatory as it tells printf that we are giving it the ASCII representation of a number.
  7. else printf "%c" "$c": Print out the character with no unicode transliteration performed


回答2:

Using Python

#!/usr/bin/env python3.2
import sys
text = sys.argv[1]
encoded = "".join("<U{0:04X}>".format(ord(char)) for char in text)
print(encoded)

Usage:

$ python3 file.py "enter_input"
<U0065><U006E><U0074><U0065><U0072><U005F><U0069><U006E><U0070><U0075><U0074>

(The same script should work for both python 3.x and 2.x. Just change the version in shebang to the one you have.)

Explanation:

  1. We need to import the sys module to read the command-line arguments.

  2. The sys.argv list is the list of all command-line arguments. The entry [0] is the program name, entry [1] is the first argument, etc.

  3. f(char) for char in text is a generator expression. It will loop for each character in the text variable, then apply the function f on it, and finally collect the result as a lazy list (iterable).

  4. ord(char) finds the Unicode code-point of the character.

  5. "<U{0:04X}>".format(x) is a string formatting method as described by the name. The format string takes 1 input x, and format into the 04X format, meaning leading-zero, width-4, uppercase-hexadecimal.

  6. "".join(it) concatenates all elements in the lazy list (iterable) it. The "" means the separator is an empty string.

  7. print(encoded) write the string encoded to stdout.



回答3:

echo -n "aä" | ruby -KU -e '$<.chars{|c| print "<U"+"%04X"%c.unpack("U*")[0]+">"}; puts'

Outputs <U0061><U00E4>

-KU = $KCODE = "U"



回答4:

Shell script solution:

#!/bin/sh

while IFS= read -r -n1 c;
    do printf "<U%04X>" "'$c";
done

This reads standard input and prints to standard output (assuming you've put the script into the executable file toUnicode.sh):

> echo "hello" | toUnicode.sh
<U0068><U0065><U006C><U006C><U006F><U0000>

This does print the EOF character (the <U0000>), but you can alter this script to suit your needs, whether you want to read the input one line at a time or trim it or alter it another way.