Generate URI-friendly Unicode code points from int

2019-04-02 12:23发布

I have a need for mapping an integer counter to URI-friendly Unicode code points (I'm writing a URL shortener not restricted to the typical ASCII base-62, 0-9a-zA-Z). I already have a prototype working; the web server is receiving GET requests for the %-encoded UTF-8 value of the Unicode code point (from Firefox, anyway) so it is very easy to handle.

Now, the difficult part I've reached is converting the primary key of the URL being shortened - an integer - into usable Unicode code point(s) (code points, for when I exceed the number of single code points I can use and have to use multiple code points). Right now my counter is sometimes creating bad code points that aren't usable. I read up a bit on Unicode, and I understand that there are a lot of things to take into account:

  • Non-displayable characters
    • Noncharacters
    • Control codes
    • High/Low surrogates
    • Private-Use code points
    • Formatting, Bidi characters
  • Combining characters / diacritical marks
  • Whitespace
  • Duplicate/repeated characters
  • URI-scheme reserved characters, like /, +, ., ? (not a Unicode thing)

My simple solution is to create a set of code points to map to that covers as many usable ones as I can by avoiding the 'bad character' ranges above, as well as only including code points that are, in themselves, also grapheme cluster boundaries, i.e. not mutable by combining characters / diacritics (although I suppose if I blacklist diacritic code points this won't matter). Is that a fair assumption? Is there a relatively easy way to generate such a set of code points?

I've seen links to tools like unichars and uniprops, but I don't think I understand Unicode properties enough to realize if they will help me in this situation or not. I'm not interested in a fully-exhaustive list of usable code points, but >70% coverage would be awesome. I'm much more keen to keep the 'bad' code points out.

Another issue I'm wondering about is whether reserved code points and/or allocated code points without displayable representations (that look like a rectangular box with the hex value inside) should also be filtered out. Anecdotally, they appear to work, so I plan on leaving them in. Any good reason not to?

Apologies in advance if my Unicode terminology is incorrect.

TL;DR


How can I generate the set of all Unicode code points that are displayable (no control/formatting code points), excluding whitespace, duplicate/repeated characters, and combining characters/diacritical marks?

2条回答
闹够了就滚
2楼-- · 2019-04-02 12:30

The easiest solution that I found was one that I just randomly stumbled upon: this official Unicode Properties JSP Web app. I believe this is the query I used:

[:Diacritic=No:]&[:Noncharacter_Code_Point=No:]&[:Deprecated=No:]&[:White_Space=No:]&[:General_Category=Math_Symbol:]|[:General_Category=Symbol:]|[:General_Category=Letter:]|[:General_Category=Punctuation:]|[:General_Category=Currency_Symbol:]|[:General_Category=Number:]&[:General_Category!=Modifier_Letter:]&[:General_Category!=Modifier_Symbol:]

Which yields 107,401 code points. I then filtered out the URI reserved characters and a couple of others just to be safe before storing them in my database. Here is my working prototype, in unadvertised beta.

Some other things I tried, unsuccessfully:

I tried the Perl unichars utility, which I believe has the capability to do what I need, but my version of Perl (5.10.1) is linked to a Unicode 5.x standard; I couldn't quickly find any instructions for upgrading to the Unicode 6.0.0 standard. I had considered writing a Ruby app similar to unichars, but my Ruby install is also on a Unicode 5.2 standard (Ruby 1.9.2, ActiveSupport 3.0.8). I found a way to apparently load a different Unicode table, but there is no documentation for it and the unicode_tables.dat file on my system is a binary file so no easy answer there.

I had also considered parsing the Unicode 6.0.0 standard's UnicodeData.txt file myself, but apparently there are ranges of code points missing, such as Han, which would require me parsing yet another file in its own format.

查看更多
你好瞎i
3楼-- · 2019-04-02 12:39

Part of what you're asking may be impossible. No single font contains glyphs for all the Unicode characters, and most systems don't have enough fonts to cover all of Unicode. So if by "displayable" you mean the user can actually see a glyph, that's a problem.

There's also no guarantee that the glyphs for two different Unicode characters actually look different, but this file gives information about characters that are similar (for example, number sign and music sharp sign). That's probably as close as you can get to filtering out duplicate/repeated characters.

Otherwise, the Unicode character database should give you enough information about each character to let you filter out the ones you don't want (control characters, combining characters, whitespace).

查看更多
登录 后发表回答