Generate random UTF-8 string in Python

2019-01-23 02:59发布

I'd like to test the Unicode handling of my code. Is there anything I can put in random.choice() to select from the entire Unicode range, preferably not an external module? Neither Google nor StackOverflow seems to have an answer.

Edit: It looks like this is more complex than expected, so I'll rephrase the question - Is the following code sufficient to generate all valid non-control characters in Unicode?

unicode_glyphs = ''.join(
    unichr(char)
    for char in xrange(1114112) # 0x10ffff + 1
    if unicodedata.category(unichr(char))[0] in ('LMNPSZ')
    )

8条回答
姐就是有狂的资本
2楼-- · 2019-01-23 03:12

You could download a website written in greek or german that uses unicode and feed that to your code.

查看更多
可以哭但决不认输i
4楼-- · 2019-01-23 03:22

It depends how thoroughly you want to do the testing and how accurately you want to do the generation. In full, Unicode is a 21-bit code set (U+0000 .. U+10FFFF). However, some quite large chunks of that range are set aside for custom characters. Do you want to worry about generating combining characters at the start of a string (because they should only appear after another character)?

The basic approach I'd adopt is randomly generate a Unicode code point (say U+2397 or U+31232), validate it in context (is it a legitimate character; can it appear here in the string) and encode valid code points in UTF-8.

If you just want to check whether your code handles malformed UTF-8 correctly, you can use much simpler generation schemes.

Note that you need to know what to expect given the input - otherwise you are not testing; you are experimenting.

查看更多
劳资没心,怎么记你
5楼-- · 2019-01-23 03:26

People may find their way here based mainly on the question title, so here's a way to generate a random string containing a variety of Unicode characters. To include more (or fewer) possible characters, just extend that part of the example with the code point ranges that you want.

import random

def get_random_unicode(length):

    try:
        get_char = unichr
    except NameError:
        get_char = chr

    # Update this to include code point ranges to be sampled
    include_ranges = [
        ( 0x0021, 0x0021 ),
        ( 0x0023, 0x0026 ),
        ( 0x0028, 0x007E ),
        ( 0x00A1, 0x00AC ),
        ( 0x00AE, 0x00FF ),
        ( 0x0100, 0x017F ),
        ( 0x0180, 0x024F ),
        ( 0x2C60, 0x2C7F ),
        ( 0x16A0, 0x16F0 ),
        ( 0x0370, 0x0377 ),
        ( 0x037A, 0x037E ),
        ( 0x0384, 0x038A ),
        ( 0x038C, 0x038C ),
    ]

    alphabet = [
        get_char(code_point) for current_range in include_ranges
            for code_point in range(current_range[0], current_range[1] + 1)
    ]
    return ''.join(random.choice(alphabet) for i in range(length))

if __name__ == '__main__':
    print('A random string: ' + get_random_unicode(10))
查看更多
聊天终结者
6楼-- · 2019-01-23 03:33

Answering revised question:

Yes, on a strict definition of "control characters" -- note that you won't include CR, LF, and TAB; is that what you want?

Please consider responding to my earlier invitation to tell us what you are really trying to do.

查看更多
对你真心纯属浪费
7楼-- · 2019-01-23 03:34

Follows a code that print any printable character of UTF-8:

print(''.join(tuple(chr(l) for l in range(1, 0x10ffff)
                    if chr(l).isprintable())))

All characters are present, even those that are not handled by the used font. and not chr(l).isspace() can be added in order to filter out all space characters. (including tab)

查看更多
登录 后发表回答