I am trying to match five substrings in each block of text (there are 100 blocks total).
I am matching 99% of the blocks of text, but with a few errors regarding groups 3 and 4.
Here is a demo link: https://regex101.com/r/cW2Is3/4
Group 3 is "parts of speech", and group 4 is an English translation.
In the first block of text, det, pro
should all be in group 3, and then the; him, her, it, them
should be in group 4.
The same issue occurs again in the third block of text.
Group 3 should be adj, det, nm, pro
and Group 4 should be a, an, one
.
This is my pattern:
([0-9]+)\s+(\w+(?:, \w+)?)\s+(\N+?)\s+(\H.+).*?\r?\n•\s+([\s\S]*?)\s+[0-9]+\s\|.*\s*
Voici...
Demo Link
I have done my best to optimize the pattern. I shaved nearly 10,000 steps off of your pattern and reached 100 matches as desired.
^
is used to identify start of each block (Efficiency / Accuracy)\d
is used instead of[0-9]
(Brevity)\s
is replaced with a literal space where applicable (Brevity)\w
for capture group 3. (Efficiency) *could be replaced with[\w()]
for brevity with a loss of efficiency\x{2022}
(Personal preference)[-\dn]
. (Efficiency / Accuracy)When you have to describe a long string with many parts, the first reflex is to use the free-space mode (x modifier) and named groups (even if named groups aren't very useful in a replacement context, they help to make the pattern readable and more easy to debug):
demo
There are no magic recipe to build a pattern for a string with a blurred format. All you can do is to be the most constrictive at the beginning and to add flexibility when you encounter cases that don't match.