Generate random UTF-8 string in Python

2019-01-23 02:59发布

I'd like to test the Unicode handling of my code. Is there anything I can put in random.choice() to select from the entire Unicode range, preferably not an external module? Neither Google nor StackOverflow seems to have an answer.

Edit: It looks like this is more complex than expected, so I'll rephrase the question - Is the following code sufficient to generate all valid non-control characters in Unicode?

unicode_glyphs = ''.join(
    unichr(char)
    for char in xrange(1114112) # 0x10ffff + 1
    if unicodedata.category(unichr(char))[0] in ('LMNPSZ')
    )

8条回答
Deceive 欺骗
2楼-- · 2019-01-23 03:37

Here is an example function that probably creates a random well-formed UTF-8 sequence, as defined in Table 3–7 of Unicode 5.0.0:

#!/usr/bin/env python3.1

# From Table 3–7 of the Unicode Standard 5.0.0

import random

def byte_range(first, last):
    return list(range(first, last+1))

first_values = byte_range(0x00, 0x7F) + byte_range(0xC2, 0xF4)
trailing_values = byte_range(0x80, 0xBF)

def random_utf8_seq():
    first = random.choice(first_values)
    if first <= 0x7F:
        return bytes([first])
    elif first <= 0xDF:
        return bytes([first, random.choice(trailing_values)])
    elif first == 0xE0:
        return bytes([first, random.choice(byte_range(0xA0, 0xBF)), random.choice(trailing_values)])
    elif first == 0xED:
        return bytes([first, random.choice(byte_range(0x80, 0x9F)), random.choice(trailing_values)])
    elif first <= 0xEF:
        return bytes([first, random.choice(trailing_values), random.choice(trailing_values)])
    elif first == 0xF0:
        return bytes([first, random.choice(byte_range(0x90, 0xBF)), random.choice(trailing_values), random.choice(trailing_values)])
    elif first <= 0xF3:
        return bytes([first, random.choice(trailing_values), random.choice(trailing_values), random.choice(trailing_values)])
    elif first == 0xF4:
        return bytes([first, random.choice(byte_range(0x80, 0x8F)), random.choice(trailing_values), random.choice(trailing_values)])

print("".join(str(random_utf8_seq(), "utf8") for i in range(10)))

Because of the vastness of the Unicode standard I cannot test this thoroughly. Also note that the characters are not equally distributed (but each byte in the sequence is).

查看更多
闹够了就滚
3楼-- · 2019-01-23 03:37

Since Unicode is just a range of - well - codes, what about using unichr() to get the unicode string corresponding to a random number between 0 and 0xFFFF?
(Of course that would give just one codepoint, so iterate as required)

查看更多
登录 后发表回答