Generate random UTF-8 string in Python

I'd like to test the Unicode handling of my code. Is there anything I can put in random.choice() to select from the entire Unicode range, preferably not an external module? Neither Google nor StackOverflow seems to have an answer.

Edit: It looks like this is more complex than expected, so I'll rephrase the question - Is the following code sufficient to generate all valid non-control characters in Unicode?

unicode_glyphs = ''.join(
    unichr(char)
    for char in xrange(1114112) # 0x10ffff + 1
    if unicodedata.category(unichr(char))[0] in ('LMNPSZ')
    )

标签： python unicode utf-8 random

8条回答

姐就是有狂的资本

2楼-- · 2019-01-23 03:12

You could download a website written in greek or german that uses unicode and feed that to your code.

0人赞添加讨论(0) 举报

看我几分像从前

3楼-- · 2019-01-23 03:21

There is a UTF-8 stress test from Markus Kuhn you could use.

0人赞添加讨论(0) 举报

可以哭但决不认输i

4楼-- · 2019-01-23 03:22

It depends how thoroughly you want to do the testing and how accurately you want to do the generation. In full, Unicode is a 21-bit code set (U+0000 .. U+10FFFF). However, some quite large chunks of that range are set aside for custom characters. Do you want to worry about generating combining characters at the start of a string (because they should only appear after another character)?

The basic approach I'd adopt is randomly generate a Unicode code point (say U+2397 or U+31232), validate it in context (is it a legitimate character; can it appear here in the string) and encode valid code points in UTF-8.

If you just want to check whether your code handles malformed UTF-8 correctly, you can use much simpler generation schemes.

Note that you need to know what to expect given the input - otherwise you are not testing; you are experimenting.

0人赞添加讨论(0) 举报

劳资没心，怎么记你

5楼-- · 2019-01-23 03:26

People may find their way here based mainly on the question title, so here's a way to generate a random string containing a variety of Unicode characters. To include more (or fewer) possible characters, just extend that part of the example with the code point ranges that you want.

import random

def get_random_unicode(length):

    try:
        get_char = unichr
    except NameError:
        get_char = chr

    # Update this to include code point ranges to be sampled
    include_ranges = [
        ( 0x0021, 0x0021 ),
        ( 0x0023, 0x0026 ),
        ( 0x0028, 0x007E ),
        ( 0x00A1, 0x00AC ),
        ( 0x00AE, 0x00FF ),
        ( 0x0100, 0x017F ),
        ( 0x0180, 0x024F ),
        ( 0x2C60, 0x2C7F ),
        ( 0x16A0, 0x16F0 ),
        ( 0x0370, 0x0377 ),
        ( 0x037A, 0x037E ),
        ( 0x0384, 0x038A ),
        ( 0x038C, 0x038C ),
    ]

    alphabet = [
        get_char(code_point) for current_range in include_ranges
            for code_point in range(current_range[0], current_range[1] + 1)
    ]
    return ''.join(random.choice(alphabet) for i in range(length))

if __name__ == '__main__':
    print('A random string: ' + get_random_unicode(10))

0人赞添加讨论(0) 举报

聊天终结者

6楼-- · 2019-01-23 03:33

Answering revised question:

Yes, on a strict definition of "control characters" -- note that you won't include CR, LF, and TAB; is that what you want?

Please consider responding to my earlier invitation to tell us what you are really trying to do.

0人赞添加讨论(0) 举报

对你真心纯属浪费

7楼-- · 2019-01-23 03:34

Follows a code that print any printable character of UTF-8:

print(''.join(tuple(chr(l) for l in range(1, 0x10ffff)
                    if chr(l).isprintable())))

All characters are present, even those that are not handled by the used font. and not chr(l).isspace() can be added in order to filter out all space characters. (including tab)

0人赞添加讨论(0) 举报

1 2 下一页

Generate random UTF-8 string in Python

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间