I have a need for mapping an integer counter to URI-friendly Unicode code points (I'm writing a URL shortener not restricted to the typical ASCII base-62, 0-9a-zA-Z
). I already have a prototype working; the web server is receiving GET requests for the %-encoded UTF-8 value of the Unicode code point (from Firefox, anyway) so it is very easy to handle.
Now, the difficult part I've reached is converting the primary key of the URL being shortened - an integer - into usable Unicode code point(s) (code points, for when I exceed the number of single code points I can use and have to use multiple code points). Right now my counter is sometimes creating bad code points that aren't usable. I read up a bit on Unicode, and I understand that there are a lot of things to take into account:
- Non-displayable characters
- Noncharacters
- Control codes
- High/Low surrogates
- Private-Use code points
- Formatting, Bidi characters
- Combining characters / diacritical marks
- Whitespace
- Duplicate/repeated characters
- URI-scheme reserved characters, like
/
,+
,.
,?
(not a Unicode thing)
My simple solution is to create a set of code points to map to that covers as many usable ones as I can by avoiding the 'bad character' ranges above, as well as only including code points that are, in themselves, also grapheme cluster boundaries, i.e. not mutable by combining characters / diacritics (although I suppose if I blacklist diacritic code points this won't matter). Is that a fair assumption? Is there a relatively easy way to generate such a set of code points?
I've seen links to tools like unichars and uniprops, but I don't think I understand Unicode properties enough to realize if they will help me in this situation or not. I'm not interested in a fully-exhaustive list of usable code points, but >70% coverage would be awesome. I'm much more keen to keep the 'bad' code points out.
Another issue I'm wondering about is whether reserved code points and/or allocated code points without displayable representations (that look like a rectangular box with the hex value inside) should also be filtered out. Anecdotally, they appear to work, so I plan on leaving them in. Any good reason not to?
Apologies in advance if my Unicode terminology is incorrect.
TL;DR
How can I generate the set of all Unicode code points that are displayable (no control/formatting code points), excluding whitespace, duplicate/repeated characters, and combining characters/diacritical marks?