I'd like to test the Unicode handling of my code. Is there anything I can put in random.choice() to select from the entire Unicode range, preferably not an external module? Neither Google nor StackOverflow seems to have an answer.
Edit: It looks like this is more complex than expected, so I'll rephrase the question - Is the following code sufficient to generate all valid non-control characters in Unicode?
unicode_glyphs = ''.join(
unichr(char)
for char in xrange(1114112) # 0x10ffff + 1
if unicodedata.category(unichr(char))[0] in ('LMNPSZ')
)
You could download a website written in greek or german that uses unicode and feed that to your code.
There is a UTF-8 stress test from Markus Kuhn you could use.
See also Really Good, Bad UTF-8 example test data.
It depends how thoroughly you want to do the testing and how accurately you want to do the generation. In full, Unicode is a 21-bit code set (U+0000 .. U+10FFFF). However, some quite large chunks of that range are set aside for custom characters. Do you want to worry about generating combining characters at the start of a string (because they should only appear after another character)?
The basic approach I'd adopt is randomly generate a Unicode code point (say U+2397 or U+31232), validate it in context (is it a legitimate character; can it appear here in the string) and encode valid code points in UTF-8.
If you just want to check whether your code handles malformed UTF-8 correctly, you can use much simpler generation schemes.
Note that you need to know what to expect given the input - otherwise you are not testing; you are experimenting.
People may find their way here based mainly on the question title, so here's a way to generate a random string containing a variety of Unicode characters. To include more (or fewer) possible characters, just extend that part of the example with the code point ranges that you want.
Answering revised question:
Yes, on a strict definition of "control characters" -- note that you won't include CR, LF, and TAB; is that what you want?
Please consider responding to my earlier invitation to tell us what you are really trying to do.
Follows a code that print any printable character of UTF-8:
All characters are present, even those that are not handled by the used font.
and not chr(l).isspace()
can be added in order to filter out all space characters. (including tab)