In some RightToLeft languages (Like Arabic, Persian, Urdu, etc) each letter can have different shapes. There is isolated form, initial form, and middle form (you can just find it on the Character Map of the windows for any unicode font).
Imagine you need the exact characters that user has been entered on a text box, by default, when you converting the String to CharArray, it will convert each character to Isolated form.
(because when user entering the characters by keyboard, it is in the isolated form and when it is displaying on the screen, it will be converted to proper format; this is just a guess. because if you make the string by using exact character codes, it will generate the proper array).
My question is, how we can get that form of the string, the form that has been displayed in the textbox.
If there is no way in .NET then this means i need to make my own class to convert this T_T
Windows uses Uniscribe to perform contextual shaping for complex scripts (which can apply to l-to-r as well as r-to-l languages). The displayed text in a text box is based on the glyph info after the characters have been fed into Uniscribe. Although the Unicode standard defines code points for each of isolated, initial, medial, and final forms of a chracter, not all fonts necessarily support them yet they may have pre-shaped glyphs or use a combination of glyphs—Uniscribe uses a shaping engine from the Windows language pack to determine which glyph(s) to use, based on the font's cmap. Here are some relevant links:
- More Uniscribe Mysteries (explains difference between glyphs and characters)
- Microsoft Bhasha, Glyph Processing: Uniscribe
- MSDN: Complex Scripts Awareness
- Buried in the bowels of Mozilla code is code that handles complex script rendering using Uniscribe. There's also additional code that scans the list of fonts in the system and reads the cmap tables of each font. (From the comments at http://www.siao2.com/2005/12/06/500485.aspx).
- Sorting it all Out: Did he say shaping? It's not in the script!
The TextRenderer.DrawText() method uses Uniscribe via the Win32 DrawTextExW() function, using the following P/Invoke:
[DllImport("user32.dll", CharSet=CharSet.Unicode, SetLastError=true)]
public static extern int DrawTextExW( HandleRef hDC
,string lpszString
,int nCount
,ref RECT lpRect
,int nFormat
,[In, Out] DRAWTEXTPARAMS lpDTParams);
[StructLayout(LayoutKind.Sequential)]
public struct RECT
{
public int left;
public int top;
public int right;
public int bottom;
}
[StructLayout(LayoutKind.Sequential)]
public class DRAWTEXTPARAMS
{
public int iTabLength;
public int iLeftMargin;
public int iRightMargin;
public int uiLengthDrawn;
}
So how are you creating the "wrong" string? If you're just putting it in a string literal, then it's quite possible it's just the input method that's wrong. If you copy the "right" string after displaying it, and then paste that into a string literal, what happens? You might also want to check which encoding Visual Studio is using for your source files. If you're not putting the string into your source code as a literal, how are you creating it?
Given the possibility for confusing, I think I'd want to either keep these strings in a resource, or hard code them using unicode escaping:
string text = "\ufb64\ufea0\ufe91\feea";
(Then possibly put a comment afterwards showing the non-escaped value; at least then if it looks about right, it won't be too misleading. Admittedly it's then easy for the two to get out of sync...)
This is a bit of a wild guess, but does String.Normalize() help here? It is unclear to me whether that just covers character composition or if it includes positional forms as well.