How to make a group for each word in a sentence?

2019-06-25 13:06发布

问题:

This may be a silly question but...

Say you have a sentence like:

The quick brown fox

Or you might get a sentence like:

The quick brown fox jumped over the lazy dog

The simple regexp (\w*) finds the first word "The" and puts it in a group.

For the first sentence, you could write (\w*)\s*(\w*)\s*(\w*)\s*(\w*)\s* to put each word in its own group, but that assumes you know the number of words in the sentence.

Is it possible to write a regular expression that puts each word in any arbitrary sentence into its own group? It would be nice if you could do something like (?:(\w*)\s*)* to have it group each instance of (\w*), but that doesn't work.

I am doing this in Python, and my use case is obviously a little more complex than "The quick brown fox", so it would be nifty if Regex could do this in one line, but if that's not possible then I assume the next best solution is to loop over all the matches using re.findall() or something similar.

Thanks for any insight you may have.

Edit: For completeness's sake here's my actual use case and how I solved it using your help. Thanks again.

>>> s = '1 0 5 test1 5 test2 5 test3 5 test4 5 test5'
>>> s = re.match(r'^\d+\s\d+\s?(.*)', s).group(1)
>>> print s
5 test1 5 test2 5 test3 5 test4 5 test5
>>> list = re.findall(r'\d+\s(\w+)', s)
>>> print list
['test1', 'test2', 'test3', 'test4', 'test5']

回答1:

I don't believe that it is possible. Regexes pair the captures with the parentheses in the given regular expression... if you only listed one group, like '((\w+)\s+){0,99}', then it would just repeatedly capture to the same first and second group... not create new groups for each match found.

You could use split, but that only splits on one character value, not a class of characters like whitespace.

Instead, you can use re.split, which can split on a regular expression, and give it '\s' to match any whitespace. You probably want it to match '\s+' to gather the whitespace greedily.

>>> import re
>>> help(re.split)
Help on function split in module re:

split(pattern, string, maxsplit=0)
    Split the source string by the occurrences of the pattern,
    returning a list containing the resulting substrings.

>>> re.split('\s+', 'The   quick brown\t fox')
['The', 'quick', 'brown', 'fox']
>>>


回答2:

You can also use the function findall in the module re

import re
>>> re.findall("\w+", "The quick brown fox")
['The', 'quick', 'brown', 'fox']


回答3:

Why use a regex when string.split does the same thing?

>>> "The quick brown fox".split()
['The', 'quick', 'brown', 'fox']


回答4:

Regular expressions can't group into unknown number of groups. But there is hope in your case. Look into the 'split' method, it should help in your case.