First of all, this is not a duplicate of: Quickest way to enumerate the alphabet
Because I need to get all the characters of the alphabet OF AN ARBITRARY (variable) LANGUAGE, and that in the correct ordering sequence.
How can I do that without knowing the alphabet of every possible culture/language ? System.Gobalization.Cultureinfo for example has information on date format, and a sorting method, and codepage info. But not info on the alphabet itselfs. Forthermore 'A' to 'Z' ordering iterating won't do, because German for example has characters such as ÄÖÜ, which are after 'Z' in the codepage numbering, but follow after aou when sorting.
Can I somehow use the codepages to get all the characters, and sort them somehow ? By 'all the characters' I mean all letters, including numbers, but not punctuation marks. And possibly only upper XOR lowercase.
If your reason for wanting to enumerate an alphabet is to produce an index, then you can use Windows.Globalization.Collation.CharacterGroupings
I don't think that the .Net framework provides what you want. First of all, not all languages have alphabets in the western sense of the word. Second, even if you limit your coverage to those languages that have alphabets, iterating through the contents of a code page won't work because many code pages cover several languages (eg, CP 1252 covers the main western european languages). Third, some of the more recently supported languages on Windows don't have code pages. I don't think there is a solution outside of having a priori knowledge of the alphabets of the languages you're interested in.
Perhaps if you explained what you are trying to achieve, a better solution could be suggested.
First off, let me say that I agree with what everyone else is saying. Would you consider the character
é
to be a valid US English character? It gets used pretty often but its not in the normal "a-z".That said, here's some code (VB2010). This code calls into the unmanaged function
GetLocaleInfoW
and asks for aLOCALESIGNATURE
structure which contains Unicode code point ranges. This information is used to determine what ranges are needed for a given font.The
Char
structure doesn't support all of the Unicode code points so the function returnsString
s instead. Look for "Surrogate pair" at the bottom of that link for more info.This code doesn't do everything that you want, unfortunately. For example, the oft cited Finnish language doesn't have the letter
W
but in Windows the character exists in the valid code-point range. I don't know a way of getting down to the nitty-gritty on that.