I was trying to convert a file from utf-8 to Arabic-1265 encoding using the Encoding APIs in C#, but I faced a strange problem that some characters are not converted correctly such as "لا" in the following statement "ﻣﺣﻣد ﺻﻼ ح عادل" it appears as "ﻣﺣﻣد ﺻ? ح عادل". Some of my friends told me that this is because these characters are from the Arabic Presentation Forms B. I create the file using notepad++ and save it as utf-8.
here is the code I use
StreamReader sr = new StreamReader(@"C:\utf-8.txt", Encoding.UTF8);
string str = sr.ReadLine();
StreamWriter sw = new StreamWriter(@"C:\windows-1256.txt", false, Encoding.GetEncoding("windows-1256"));
sw.Write(str);
sw.Flush();
sw.Close();
But, I don't know how to convert the file correctly using this presentation forms in C#.
Yes, your string contains lots of ligatures that cannot be represented in the 1256 code page. You'll have to decompose the string before writing it. Like this:
str = str.Normalize(NormalizationForm.FormKD);
st.Write(str);
To give a more general answer:
The Windows-1256 encoding is an obsolete 8-bit character encoding. It has only 256 characters, of which only 60 are Arabic letters.
Unicode has a much wider range of characters. In particular, it contains:
the “normal” Arabic characters, U+0600 to U+06FF. These are supposed to be used for normal Arabic text, including text written in other languages that use the Arabic script, such as Farsi. For example, “لا” is U+0644 (ل) followed by U+0627 (ا).
the “Presentation Form” characters, U+FB50 to U+FDFF (“Presentation Forms-A”) and U+FE70 to U+FEFF (“Presentation Forms-B”). These are not intended to be used for representing Arabic text. They are primarily intended for compatibility, especially with font-file formats that require separate code points for every different ligated form of every character and ligated character combination. The “لا” ligature is represented by a single codepoint (U+FEFB) despite being two characters.
When encoding into Windows-1256, the .NET encoding for Windows-1256 will automatically convert characters from the Presentation Forms block to “normal text” because it has no other choice (except of course to turn it all into question marks). For obvious reasons, it can only do that with characters that actually have an “equivalent”.
When decoding from Windows-1256, the .NET encoding for Windows-1256 will always generate characters from the “normal text” block.
As we’ve discovered, your input file contains characters that are not representable in Windows-1256. Such characters will turn into question marks (?
). Furthermore, those Presentation-Form characters which do have a normal-text equivalent, will change their ligation behaviour, because that is what normal Arabic text does.
First of all, the two characters you quoted are not from the Arabic Presentation Forms block. They are \x0644
and \x0627
, which are from the standard Arabic block. However, just to be sure I tried the character \xFEFB
, which is the “equivalent” (not equivalent, but you know) character for لا from the Presentation Forms block, and it works fine even for that.
Secondly, I will assume you mean the encoding Windows-1256, which is for legacy 8-bit Arabic text.
So I tried the following:
var input = "لا";
var encoding = Encoding.GetEncoding("windows-1256");
var result = encoding.GetBytes(input);
Console.WriteLine(string.Join(", ", result));
The output I get is 225, 199
. So let’s try to turn it back:
var bytes = new byte[] { 225, 199 };
var result2 = encoding.GetString(bytes);
Console.WriteLine(result2);
Fair enough, the Console does not display the result correctly — but the Watch window in the debugger tells me that the answer is correct (it says “لا”). I can also copy the output from the Console and it is correct in the clipboard.
Therefore, the Windows-1256 encoding is working just fine and it is not clear what your problem is.
My recommendation:
Write a short piece of code that exhibits the problem.
Post a new question with that piece of code.
In that question, describe exactly what result you get, and what result you expected instead.