Convert command line arguments to regular expressi

2019-03-25 01:06发布

问题:

Say, for example, I want to know whether the pattern "\section" is in the text "abcd\sectiondefghi". Of course, I can do this:

import re

motif = r"\\section"
txt = r"abcd\sectiondefghi"
pattern = re.compile(motif)
print pattern.findall(txt)

That will give me what I want. However, each time I want to find a new pattern in a new text, I have to change the code which is painful. Therefore, I want to write something more flexible, like this (test.py):

import re
import sys

motif = sys.argv[1]
txt = sys.argv[2]
pattern = re.compile(motif)
print pattern.findall(txt)

Then, I want to run it in terminal like this:

python test.py \\section abcd\sectiondefghi

However, that will not work (I hate to use \\\\section).

So, is there any way of converting my user input (either from terminal or from a file) to python raw string? Or is there a better way of doing the regular expression pattern compilation from user input?

Thank you very much.

回答1:

Use re.escape() to make sure input text is treated as literal text in a regular expression:

pattern = re.compile(re.escape(motif))

Demo:

>>> import re
>>> motif = r"\section"
>>> txt = r"abcd\sectiondefghi"
>>> pattern = re.compile(re.escape(motif))
>>> txt = r"abcd\sectiondefghi"
>>> print pattern.findall(txt)
['\\section']

re.escape() escapes all non-alphanumerics; adding a backslash in front of each such a character:

>>> re.escape(motif)
'\\\\section'
>>> re.escape('\n [hello world!]')
'\\\n\\ \\[hello\\ world\\!\\]'


回答2:

One way to do this is using an argument parser, like optparse or argparse.

Your code would look something like this:

import re
from optparse import OptionParser

parser = OptionParser()
parser.add_option("-s", "--string", dest="string",
                  help="The string to parse")
parser.add_option("-r", "--regexp", dest="regexp",
                  help="The regular expression")
parser.add_option("-a", "--action", dest="action", default='findall',
                  help="The action to perform with the regexp")

(options, args) = parser.parse_args()

print getattr(re, options.action)(re.escape(options.regexp), options.string)

An example of me using it:

> code.py -s "this is a string" -r "this is a (\S+)"
['string']

Using your example:

> code.py -s "abcd\sectiondefghi" -r "\section"
['\\section'] 
# remember, this is a python list containing a string, the extra \ is okay.


回答3:

So just to be clear, is the thing you search for ("\section" in your example) supposed to be a regular expression or a literal string? If the latter, the re module isn't really the right tool for the task; given a search string needle and a target string haystack, you can do:

# is it in there
needle in haystack

# how many copies are there
n = haystack.count(needle)
python test.py \\section abcd\sectiondefghi
# where is it
ix = haystack.find(needle)

all of which are more efficient than the regexp-based version.

re.escape is still useful if you need to insert a literal fragment into a larger regexp at runtime, but if you end up doing re.compile(re.escape(needle)), there are for most cases better tools for the task.

EDIT: I'm beginning to suspect that the real issue here is the shell's escaping rules, which has nothing to do with Python or raw strings. That is, if you type:

python test.py \\section abcd\sectiondefghi

into a Unix-style shell, the "\section" part is converted to "\section" by the shell, before Python sees it. The simplest way to fix that is to tell the shell to skip unescaping, which you can do by putting the argument inside single quotes:

python test.py '\\section' 'abcd\sectiondefghi'

Compare and contrast:

$ python -c "import sys; print ','.join(sys.argv)" test.py \\section abcd\sectiondefghi
-c,test.py,\section,abcdsectiondefghi

$ python -c "import sys; print ','.join(sys.argv)" test.py '\\section' 'abcd\sectiondefghi'
-c,test.py,\\section,abcd\sectiondefghi

(explicitly using print on a joined string here to avoid repr adding even more confusion...)