I'm trying to output unicode string into RTF format. (using c# and winforms)
If a Unicode escape is required, the control word \u is used, followed by a 16-bit signed decimal integer giving the Unicode codepoint number. For the benefit of programs without Unicode support, this must be followed by the nearest representation of this character in the specified code page. For example, \u1576? would give the Arabic letter beh, specifying that older programs which do not have Unicode support should render it as a question mark instead.
I don't know how to convert Unicode character into Unicode codepoint ("\u1576"). Conversion to UTF 8, UTF 16 and similar is easy, but I don't know how to convert to codepoint.
Scenario in which I use this:
- I read existing RTF file into string (I'm reading template)
- string.replace #TOKEN# with MyUnicodeString (template is populate with data)
- write result into another RTF file.
Problem, arise when Unicode characters arrived
Fixed code from accepted answer - added special character escaping, as described in this link
Provided that all the characters that you're catering for exist in the Basic Multilingual Plane (it's unlikely that you'll need anything more), then a simple UTF-16 encoding should suffice.
Wikipedia:
The following sample program illustrates doing something along the lines of what you want:
The important bit is the
Convert.ToUInt32(c)
which essentially returns the code point value for the character in question. The RTF escape for unicode requires a decimal unicode value. TheSystem.Text.Encoding.Unicode
encoding corresponds to UTF-16 as per the MSDN documentation.Based on the specification, here are some code in java which is tested and works:
The important thing is, you need to append 2 characters (close to the unicode character or just use ? instead) after the escaped uncode. because the unicode occupy 2 bytes.
Also the spec says your should use negative value if the code point greater than 32767, but in my test, it's fine if you don't use negative value.
Here is the spec:
\uN This keyword represents a single Unicode character which has no equivalent ANSI representation based on the current ANSI code page. N represents the Unicode character value expressed as a decimal number. This keyword is followed immediately by equivalent character(s) in ANSI representation. In this way, old readers will ignore the \uN keyword and pick up the ANSI representation properly. When this keyword is encountered, the reader should ignore the next N characters, where N corresponds to the last \ucN value encountered.
As with all RTF keywords, a keyword-terminating space may be present (before the ANSI characters) which is not counted in the characters to skip. While this is not likely to occur (or recommended), a \bin keyword, its argument, and the binary data that follows are considered one character for skipping purposes. If an RTF scope delimiter character (that is, an opening or closing brace) is encountered while scanning skippable data, the skippable data is considered to be ended before the delimiter. This makes it possible for a reader to perform some rudimentary error recovery. To include an RTF delimiter in skippable data, it must be represented using the appropriate control symbol (that is, escaped with a backslash,) as in plain text. Any RTF control word or symbol is considered a single character for the purposes of counting skippable characters.
An RTF writer, when it encounters a Unicode character with no corresponding ANSI character, should output \uN followed by the best ANSI representation it can manage. Also, if the Unicode character translates into an ANSI character stream with count of bytes differing from the current Unicode Character Byte Count, it should emit the \ucN keyword prior to the \uN keyword to notify the reader of the change.
RTF control words generally accept signed 16-bit numbers as arguments. For this reason, Unicode values greater than 32767 must be expressed as negative number
You will have to convert the string to a
byte[]
array (usingEncoding.Unicode.GetBytes(string)
), then loop through that array and prepend a\
andu
character to all Unicode characters you find. When you then convert the array back to a string, you'd have to leave the Unicode characters as numbers.For example, if your array looks like this:
it would become: