I am trying to match five substrings in each block of text (there are 100 blocks total).
I am matching 99% of the blocks of text, but with a few errors regarding groups 3 and 4.
Here is a demo link: https://regex101.com/r/cW2Is3/4
Group 3 is "parts of speech", and group 4 is an English translation.
In the first block of text, det, pro
should all be in group 3, and then the; him, her, it, them
should be in group 4.
The same issue occurs again in the third block of text.
Group 3 should be adj, det, nm, pro
and Group 4 should be a, an, one
.
This is my pattern:
([0-9]+)\s+(\w+(?:, \w+)?)\s+(\N+?)\s+(\H.+).*?\r?\n•\s+([\s\S]*?)\s+[0-9]+\s\|.*\s*
Voici...
/^(\d+) +(\w+) +([acdefijlmnoprtv()]+(?:, ?[acdefijlmnoprtv()]+)*) +([\S\s]+?)\n\x{2022} +([\S\s]+?)\n\d+ \| [-\dn]+\s*/gum
Demo Link
I have done my best to optimize the pattern. I shaved nearly 10,000 steps off of your pattern and reached 100 matches as desired.
- Starting anchor
^
is used to identify start of each block (Efficiency / Accuracy)
\d
is used instead of [0-9]
(Brevity)
\s
is replaced with a literal space where applicable (Brevity)
- A character class of specific letters and parentheses was used in place of
\w
for capture group 3. (Efficiency) *could be replaced with [\w()]
for brevity with a loss of efficiency
- The bullet was specified using the literal
\x{2022}
(Personal preference)
- Character class used on trailing characters of each block
[-\dn]
. (Efficiency / Accuracy)
When you have to describe a long string with many parts, the first reflex is to use the free-space mode (x modifier) and named groups (even if named groups aren't very useful in a replacement context, they help to make the pattern readable and more easy to debug):
~^
(?<No> [0-9]+ ) \h+
(?<word> \pL+ ) \h+
(?<type> [\pL()]+ (?: , \h* [\pL()]+ )* ) \h+
(?<wd_tr> [^•]* [^•\s] ) \h* \R
• \h*
(?<sent_fr> [^–]* [^\s–] ) \s* – \s*
(?<sent_eng> .* (?:\R .*)*? ) \h* \R
(?<num1> [0-9]+ ) \h* \| \h*
(?<num2> .*\S )
~xum
demo
There are no magic recipe to build a pattern for a string with a blurred format. All you can do is to be the most constrictive at the beginning and to add flexibility when you encounter cases that don't match.