EDIT : This question doesn't really make sense once you have picked up what the "r" flag means. More details here.
For people looking for a quick anwser, I added on below.
If I enter a regexp manually in a Python script, I can use 4 combinations of flags for my pattern strings :
- p1 = "pattern"
- p2 = u"pattern"
- p3 = r"pattern"
- p4 = ru"pattern"
I have a bunch a unicode strings coming from a Web form input and want to use them as regexp patterns.
I want to know what process I should apply to the strings so I can expect similar result from the usage of the manual form above. Something like :
import re
assert re.match(p1, some_text) == re.match(someProcess1(web_input), some_text)
assert re.match(p2, some_text) == re.match(someProcess2(web_input), some_text)
assert re.match(p3, some_text) == re.match(someProcess3(web_input), some_text)
assert re.match(p4, some_text) == re.match(someProcess4(web_input), some_text)
What would be someProcess1 to someProcessN and why ?
I suppose that someProcess2 doesn't need to do anything while someProcess1 should do some unicode conversion to the local encoding. For the raw string literals, I am clueless.
Apart from possibly having to encode Unicode properly (in Python 2.*), no processing is needed because there is no specific type for "raw strings" -- it's just a syntax for literals, i.e. for string constants, and you don't have any string constants in your code snippet, so there's nothing to "process".
Note the following in your first example:
>>> p1 = "pattern"
>>> p2 = u"pattern"
>>> p3 = r"pattern"
>>> p4 = ur"pattern" # it's ur"", not ru"" btw
>>> p1 == p2 == p3 == p4
True
While these constructs look different, they all do the same thing, they create a string object (p1 and p3 a str
and p2 and p4 a unicode
object in Python 2.x), containing the value "pattern
". The u
, r
and ur
just tell the parser, how to interpret the following quoted string, namely as a unicode text (u
) and/or a raw text (r
) where backslashes to encode other characters are ignored. However in the end it doesn't matter how a string was created, being it a raw string or not, internally it is stored the same.
When you get unicode text as input, you have to differ (in Python 2.x) if it is a unicode
text or a str
object. If you want to work with the unicode content, you should internally work only with those, and convert all str
objects to unicode
objects (either with str.decode()
or with the u'text'
syntax for hard-coded texts). If you however encode it to your local encoding, you will get problems with unicode symbols.
A different approach would be using Python 3, which str
object supports unicode directly and stores everything as unicode and where you simply don't need to care about the encoding.
"r" flags just prevent Python from interpreting "\" in a string. Since the Web doesn't care about what kind of data it carries, your web input will be a bunch of bytes you are free to interpret the way you want.
So to address this problem :
- be sure you use Unicode (e.g. utf-8) all long the way
- when you get the string, it will be Unicode and "\n", "\t" and "\a" will be literals, so you don't need to care about if you need to escape them of not.