I have a string. I want to cut the string up into substrings that include a number-containing word surrounded by (up to) 4 words on either side. If the substrings overlap they should combine.
Sampletext = "by the way I know 54 how to take praise for 65 excellent questions 34 thank you for asking appreciated."
re.findall('(\s[*\s]){1,4}\d(\s[*\s]){1,4}', Sampletext)
desired output = ['the way I know 54 how to take praise', 'to take praise for 65 excellent questions 34 thank you for asking']
Overlapping Matches: Use Lookaheads
This will do it:
subject = "by the way I know 54 how to take praise for 65 excellent questions 34 thank you for asking appreciated."
for match in re.finditer(r"(?=((?:\b\w+\b ){4}\d+(?: \b\w+\b){4}))", subject):
print(match.group(1))
What is a Word?
The output depends on your definition of a word. Here, in a word, I have allowed digits. This produces the following output.
Output (allowing digits in words)
the way I know 54 how to take praise
to take praise for 65 excellent questions 34 thank
for 65 excellent questions 34 thank you for asking
Option 2: No digits in Words
subject = "by the way I know 54 how to take praise for 65 excellent questions 34 thank you for asking appreciated."
for match in re.finditer(r"(?=((?:\b[a-z]+\b ){4}\d+(?: \b[a-z]+\b){4}))", subject, re.IGNORECASE):
print(match.group(1))
Output 2
the way I know 54 how to take praise
Option 3: extending to four uninterrupted non-digit words
Based on your comments, this option will extend to the left and right of the pivot until four uninterrupted non-digit words are matched. Commas are ignored.
subject = "by the way I know 54 how to take praise for 65 excellent questions 34 thank you for asking appreciated. One Two Three Four 55 Extend 66 a b c d AA BB CC DD 71 HH DD, JJ FF"
for match in re.finditer(r"(?=((?:\b[a-z]+[ ,]+){4}(?:\d+ (?:[a-z]+ ){1,3}?)*?\d+.*?(?:[ ,]+[a-z]+){4}))", subject, re.IGNORECASE):
print(match.group(1))
Output 3
the way I know 54 how to take praise
to take praise for 65 excellent questions 34 thank you for asking
One Two Three Four 55 Extend 66 a b c d
AA BB CC DD 71 HH DD, JJ FF