可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'd like to create a (PCRE) regular expression to match all commonly used numbered lists, and I'd like to share my thoughts and gather input on way to do this.

I've defined 'lists' as the set of canonical Anglo-Saxon conventions, i.e.

Numbers

1 2 3
1. 2. 3.
1) 2) 3)
(1) (2) (3)
1.1 1.2 1.2.1
1.1. 1.2. 1.3.
1.1) 1.2) 1.3)
(1.1) (1.2) (1.3)

Letters

a b c
a. b. c.
a) b) c)
(a) (b) (c) 
A B C
A. B. C. 
A) B) C)
(A) (B) (C)

Roman numerals

i ii iii
i. ii. iii.
i) ii) iii)
(i) (ii) (iii)
I II III
i. ii. iii.
i) ii) iii)
(i) (ii) (iii)

I'd like to know how strong a set of list this is, and if there are other numbering conventions that should be in there, and if any of these ought to be removed.

Here's a regular expression I've created to solve this problem (in Python):

numex = r'(?:\d{1,3}'\   # 1, 2, 3
    '(?:\.\d{1,3}){0,4}'\ # 1.1, 1.1.1.1
    '|[A-Z]{1,2}'\        # A. B. C.
    '|[ivxcl]{1,6}'       # i, iii, ...

rex = re.compile(r'(\(?%s\)|%s\.?)' % numex, re.I) # re.U?

rex.match("123. Some paragraph")

I'd like to know how adequate this regex is for this problem, and if there are other alternative (regex or otherwise) solutions.

Incidentally, for my particular use-case, I wouldn't expect list numbers of more than 25-50.

Thank you for reading.

Brian

回答1:

I'd change at least one thing, and that is to add word boundary anchors around your regex, otherwise it will match every single letter in any text:

rex = re.compile(r'(\(?\b%s\)|\b%s\b\.?)' % (numex, numes), re.I|re.M)

This helps a little, but of course any one- or two-letter word will still be matched.

You might want to anchor the search at the start of the line; after all these characters should be the first thing on the line (except maybe whitespace). A negative lookbehind won't word in Python because Python doesn't support variable-length lookbehind, so you could add this outside the matching parentheses:

rex = re.compile(r'^\s*(\(?%s\)|%s\b\.?)' % (numex, numex), re.I|re.M)

Of course, now you must look at the match object's group(1) to only get the actual match and not the leading whitespace.

You will still match too much (e. g. sentences starting with I thought so or It was a dark and stormy night, but your rules allow this, and I think you're aware of this.

回答2:

Here's a Wikified solution:

 numex = r"""^(?:
      \d{1,3}                 # 1, 2, 3
          (?:\.\d{1,3}){0,4}  # 1.1, 1.1.1.1
    | [B-H] | [J-Z]         # A, B - Z caps at 26.
    | [AI](?!\s)            # Note: "A" and "I" can properly start non-lists
    | [a-z]                 # a - z
    | [ivxcl]{1,6}          # Roman ii, etc
    | [IVXCL]{1,6}          # Roman IV, etc.
    )
    """

 rex = re.compile(r'^\s*(\(?%s\)|%s\.?)\s+(.*)'
   % (numex, numex), re.X)

Additions, changes and suggestions most welcome.