How to optionally add a comma and whitespace to a

2020-04-20 18:47发布

问题:

I am trying to match five substrings in each block of text (there are 100 blocks total).

I am matching 99% of the blocks of text, but with a few errors regarding groups 3 and 4.

Here is a demo link: https://regex101.com/r/cW2Is3/4

Group 3 is "parts of speech", and group 4 is an English translation.

In the first block of text, det, pro should all be in group 3, and then the; him, her, it, them should be in group 4.

The same issue occurs again in the third block of text.
Group 3 should be adj, det, nm, pro and Group 4 should be a, an, one.

This is my pattern:

([0-9]+)\s+(\w+(?:, \w+)?)\s+(\N+?)\s+(\H.+).*?\r?\n•\s+([\s\S]*?)\s+[0-9]+\s\|.*\s*

回答1:

Voici...

/^(\d+) +(\w+) +([acdefijlmnoprtv()]+(?:, ?[acdefijlmnoprtv()]+)*) +([\S\s]+?)\n\x{2022} +([\S\s]+?)\n\d+ \| [-\dn]+\s*/gum

Demo Link

I have done my best to optimize the pattern. I shaved nearly 10,000 steps off of your pattern and reached 100 matches as desired.

  • Starting anchor ^ is used to identify start of each block (Efficiency / Accuracy)
  • \d is used instead of [0-9] (Brevity)
  • \s is replaced with a literal space where applicable (Brevity)
  • A character class of specific letters and parentheses was used in place of \w for capture group 3. (Efficiency) *could be replaced with [\w()] for brevity with a loss of efficiency
  • The bullet was specified using the literal \x{2022} (Personal preference)
  • Character class used on trailing characters of each block [-\dn]. (Efficiency / Accuracy)


回答2:

When you have to describe a long string with many parts, the first reflex is to use the free-space mode (x modifier) and named groups (even if named groups aren't very useful in a replacement context, they help to make the pattern readable and more easy to debug):

~^
(?<No> [0-9]+ )  \h+
(?<word> \pL+ )  \h+
(?<type> [\pL()]+ (?: , \h* [\pL()]+ )* )  \h+
(?<wd_tr> [^•]* [^•\s] )  \h* \R

• \h*
(?<sent_fr> [^–]* [^\s–] )   \s* – \s*
(?<sent_eng> .* (?:\R .*)*? )  \h* \R

(?<num1> [0-9]+ )  \h* \| \h*
(?<num2> .*\S )
~xum

demo

There are no magic recipe to build a pattern for a string with a blurred format. All you can do is to be the most constrictive at the beginning and to add flexibility when you encounter cases that don't match.