I am writing a doctest for a function that outputs a list of tokenized words.
r'''
>>> s = "This is a tokenized sentence s\u00f3"
>>> tokenizer.tokenize(s0)
['This', 'is', 'a', 'tokenized', 'sentence', 'só']
'''
Using Python3.4 my test passes with no problems.
Using Python2.7 I get:
Expected:
['This', 'is', 'a', 'tokenized', 'sentence', 'só']
Got:
[u'This', u'is', u'a', u'tokenized', u'sentence', u's\xf3']
My code has to work on both Python3.4 and Python2.7. How can I solve this problem?
Python 3 uses different string literals for Unicode objects. There is no
u
prefix (in the canonical representation) and some non-ascii characters are shown literally e.g.,'só'
is a Unicode string in Python 3 (it is a bytestring on Python 2 if you see it in the output).If all you interested is how the function splits an input text into tokens; you could print each token on a separate line, to make the result Python 2/3 compatible:
As an alternative, you could customize
doctest.OutputChecker
, example: