Regex for non-consecutive upper-case words PART DE

2019-08-24 03:52发布

问题:

Very many thanks to all those who answered part 1 of this question see here

The regex that worked for me was

(?<![A-Z]\s)\b[A-Z]+\b(?!\s[A-Z])

The question now is how to do the inverse, i.e. given the string

This is a different sentence WITH a few CAPITAL WORDS here AND THERE ACROSS multiple LINES.

How to match "CAPITAL WORDS" and "AND THERE ACROSS" but not match "WITH" or "LINES" as they are isolated with lower case words either side, or they could be at the end of the start of a sentence.

I tried changing from negative to positive lookarounds and altering the [A-Z] to [a-z] but again failed spectacularly

Any help would be much appreciated once again.

回答1:

At least two consecutive upper-case words:

 [A-Z]{2,}(?:\s+[A-Z]{2,})+

 [A-Z]{2,}           # first word (At least two letters)
 (?:                 # do not capture this group
    \s+[A-Z]{2,}     #                 (whitespace and a word)
 )+                  # one or more of /

In [52]: re.findall(r'[A-Z]{2,}(?:\s+[A-Z]{2,})+', 'CAPITAL Words This is a different sentence WITH a few CAPITAL\nWORDS here AND THERE ACROSS multiple LINES.')
Out[52]: ['CAPITAL\nWORDS', 'AND THERE ACROSS']