How can I relate Unicode blocks to Languages/Scrip

I am trying to find a resource that can be used to connect Languages (or more probably Scripts) to blocks of Unicode characters. Such a resource would be used to lookup questions such as "What Unicode Blocks are used in French?" or "What languages use the block from 0A80-0AFF (http://unicodinator.com/#Block-Gujarati)?" Do you know of such a resource?

I would have expected to be able to find this information easily at unicode.org. I was quickly able to find a great table that relates Country Codes to Languages (http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/territory_language_information.html). But I've spent quite a bit of time poking around with no luck finding something that relates Unicode Blocks to Languages. Its possible I've got a terminology issue blocking me from connecting the dots here...

I am not picky about exactly what is meant by "language" (Java Locale code or ISO 639 code or whatever) in this case. I also understand that there may not be exact answers because, for instance, an Arabic document can contain Latin and other text in addition to characters from the Arabic blocks (http://unicodinator.com/#Block-Arabic, http://unicodinator.com/#Block-Arabic_Supplement). But surely there must be some table that says "these languages go with these blocks"... I'm also not picky about the format (XML, CSV, whatever), I can easily transform this into data I can use for my application. And again, I do realize the reference would probably connect Scripts to Blocks, not Languages (though Scripts can be mapped to Languages).

I do realize this will be a many-to-many table (since many languages use characters from multiple blocks, and many blocks are used by multiple languages); I do realize this cannot be precisely answered since Unicode codepoints are not language specific -- however, neither can the question of "what languages are there in this country" (answer is probably "most of them" for most countries), yet a table like this (http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/territory_language_information.html) is still possible to create, meaningful and useful.

As to why I'd want such a thing: I would like to enhance http://unicodinator.com with global heat-maps for the code blocks, and lists of languages; I also have a game concept I am tinkering with. Beyond that, there are probably many other uses other people could have for this (font creation? heuristic, quick, best-guess language detection now that the Google Translate API is going away? research projects?).

标签： unicode localization internationalization

4条回答

Ridiculous、

2楼-- · 2020-02-19 08:28

There is no such resource and for simple reason: Unicode code point assignments are language independent. Thus each code point could be used by multiple languages.

Of course there are certain characters that map directly to one language but in general each code point is meant to be shared. Therefore it does not make much sense to create code point to language tables.

If you are looking for ways to detect language, definitely this is not the way to go.

0人赞添加讨论(0) 举报

一夜七次

3楼-- · 2020-02-19 08:29

I got an answer from Unicode.org themselves! In the CLDR subproject, there are documents such as:

for each language id, which you can search for "exemplarCharacters":

<exemplarCharacters>[\u064B \u064C \u064D \u064E \u064F \u0650 \u0651 \u0652 ء آ أ ؤ إ ئ ا ب ت ة ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي ى]</exemplarCharacters>
<exemplarCharacters type="auxiliary">[\u200C\u200D\u200E\u200F]</exemplarCharacters>
<exemplarCharacters type="currencySymbol" draft="contributed">[a b c d e f g h i j k l m n o p q r s t u v w x y z]</exemplarCharacters>
<exemplarCharacters type="index" draft="contributed">[ا ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي]</exemplarCharacters>

Or, there is this page: http://unicode.org/repos/cldr-tmp/trunk/diff/by_type/misc.exemplarCharacters.html with what looks like all of them. I will work on reshuffling this data into a langid -> blockid map of some kind, at which I will probably aware @borrible the "Answer" (rather than make mine the answer).

0人赞添加讨论(0) 举报

SAY GOODBYE

4楼-- · 2020-02-19 08:39

I don't think that CLDR's exemplarCharacters will give accurate results. You can find for each character it's script property from UCD project's Scripts.txt and ScriptExtensions.txt files. For more read this (Unicode Script Property)

After you have the script, you can relate it to language in CLDR using the languageData section of the supplementalData.xml

0人赞添加讨论(0) 举报

We Are One

5楼-- · 2020-02-19 08:47

How about generating (approximate) data yourself? One example could be to use the different language wikipedias - download enough data in each language, generate a list of the characters used in the documents with counts, and put in a threshold to get rid of small instances of borrowed text from other languages. It would be approximate but possibly a good starting point.

0人赞添加讨论(0) 举报

How can I relate Unicode blocks to Languages/Scrip

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间