I'm doing some changes in Linux locale files /usr/share/i18n/locales
(like pt_BR
), and it's required that format strings (like %d-%m-%Y %H:%M
) must be specified in Unicode, where each (in this case, ASCII) character is represented as <U00xx>
.
So a text like this:
LC_TIME
d_t_fmt "%a %d %b %Y %T %Z"
d_fmt "%d-%m-%Y"
t_fmt "%T"
Must be:
LC_TIME
d_t_fmt "<U0025><U0061><U0020><U0025><U0064><U0020><U0025><U0062><U0020><U0025><U0059><U0020><U0025><U0054><U0020><U0025><U005A>"
d_fmt "<U0025><U0064><U002D><U0025><U006D><U002D><U0025><U0059>"
t_fmt "<U0025><U0054>"
Thus I need a command-line script (be it bash, Python, Perl, or something else) that would take an input like %d-%m-%Y
and convert it to <U0025><U0064><U002D><U0025><U006D><U002D><U0025><U0059>
.
All characters in the input string would be ASCII chars (from 0x20
to 0x7F
), so this is actually a fancier "char-to-hex-string" conversion.
Could anyone please help me? My skills in bash scripting are very limited, and even worse in Python.
Bonus for elegant, explained solutions.
Thanks!
(by the way, this would be the "reverse" script for my previous question)
Every char with file input
If you wanted to convert every character of a file to the unicode representation, then it would be this simple one-liner
while IFS= read -r -n1 c;do printf "<U%04X>" "'$c"; done < ./infile
Every char on STDIN
If you want to make a unix-like tool which converts input on STDIN to unicode-like output, then use this:
uni(){ c=$(cat); for((i=0;i<${#c};i++)); do printf "<U%04X>" "'${c:i:1}"; done; }
Proof of Concept
$ echo "abc" | uni
<U0061><U0062><U0063>
Only chars between double-quotes
#!/bin/bash
flag=0
while IFS= read -r -n1 c; do
if [[ "$c" == '"' ]]; then
((flag^=1))
printf "%c" "$c"
elif [[ "$c" == $'\0' ]]; then
echo
elif ((flag)); then
printf "<U%04X>" "'$c"
else
printf "%c" "$c"
fi
done < /path/to/infile
Proof of Concept
$ cat ./unime
LC_TIME
d_t_fmt "%a %d %b %Y %T %Z"
d_fmt "%d-%m-%Y"
t_fmt "%T"
abday "Dom";"Seg";/
here is a string with "multiline
quotes";/
$ ./uni.sh
LC_TIME
d_t_fmt "<U0025><U0061><U0020><U0025><U0064><U0020><U0025><U0062><U0020><U0025><U0059><U0020><U0025><U0054><U0020><U0025><U005A>"
d_fmt "<U0025><U0064><U002D><U0025><U006D><U002D><U0025><U0059>"
t_fmt "<U0025><U0054>"
abday "<U0044><U006F><U006D>";"<U0053><U0065><U0067>";/
here is a string with "<U006D><U0075><U006C><U0074><U0069><U006C><U0069><U006E><U0065>
<U0071><U0075><U006F><U0074><U0065><U0073>";/
Explanation
Pretty simply really
while IFS= read -r -n1 c;
: Iterate over the input one character at a time (via -n1
) and store the char in the variable c
. The IFS=
and -r
flags are there so that the read
builtin doesn't try to do word splitting or interpret escape sequences, respectively.
if [[ "$c" == '"' ]];
: If the current char is a double-quote
((flag^=1))
: Invert the value of flag from 0->1 or 1->0
elif [[ "$c" == $'\0' ]];
: If the current char is a NUL, then echo
a newline
elif ((flag))
: If flag is 1, then perform unicode transliteration
printf "<U%04X>" "'$c"
: The magic that does the unicode transliteration. Note that the single-quote before the $c
is mandatory as it tells printf
that we are giving it the ASCII representation of a number.
else printf "%c" "$c"
: Print out the character with no unicode transliteration performed
Using Python
#!/usr/bin/env python3.2
import sys
text = sys.argv[1]
encoded = "".join("<U{0:04X}>".format(ord(char)) for char in text)
print(encoded)
Usage:
$ python3 file.py "enter_input"
<U0065><U006E><U0074><U0065><U0072><U005F><U0069><U006E><U0070><U0075><U0074>
(The same script should work for both python 3.x and 2.x. Just change the version in shebang
to the one you have.)
Explanation:
We need to import the sys
module to read the command-line arguments.
The sys.argv
list is the list of all command-line arguments. The entry [0] is the program name, entry [1] is the first argument, etc.
f(char) for char in text
is a generator expression. It will loop for each character in the text
variable, then apply the function f
on it, and finally collect the result as a lazy list (iterable).
ord(char)
finds the Unicode code-point of the character.
"<U{0:04X}>".format(x)
is a string formatting method as described by the name. The format string takes 1 input x
, and format into the 04X
format, meaning leading-zero, width-4, uppercase-hexadecimal.
"".join(it)
concatenates all elements in the lazy list (iterable) it
. The ""
means the separator is an empty string.
print(encoded)
write the string encoded
to stdout.
echo -n "aä" | ruby -KU -e '$<.chars{|c| print "<U"+"%04X"%c.unpack("U*")[0]+">"}; puts'
Outputs <U0061><U00E4>
-KU
= $KCODE = "U"
Shell script solution:
#!/bin/sh
while IFS= read -r -n1 c;
do printf "<U%04X>" "'$c";
done
This reads standard input and prints to standard output (assuming you've put the script into the executable file toUnicode.sh):
> echo "hello" | toUnicode.sh
<U0068><U0065><U006C><U006C><U006F><U0000>
This does print the EOF character (the <U0000>
), but you can alter this script to suit your needs, whether you want to read the input one line at a time or trim it or alter it another way.