How to optionally add a comma and whitespace to a

I am trying to match five substrings in each block of text (there are 100 blocks total).

I am matching 99% of the blocks of text, but with a few errors regarding groups 3 and 4.

Here is a demo link: https://regex101.com/r/cW2Is3/4

Group 3 is "parts of speech", and group 4 is an English translation.

In the first block of text, det, pro should all be in group 3, and then the; him, her, it, them should be in group 4.

The same issue occurs again in the third block of text.
Group 3 should be adj, det, nm, pro and Group 4 should be a, an, one.

This is my pattern:

([0-9]+)\s+(\w+(?:, \w+)?)\s+(\N+?)\s+(\H.+).*?\r?\n•\s+([\s\S]*?)\s+[0-9]+\s\|.*\s*

标签： php regex optional substring capture-group

2条回答

神经病院院长

2楼-- · 2020-04-20 18:47

Voici...

/^(\d+) +(\w+) +([acdefijlmnoprtv()]+(?:, ?[acdefijlmnoprtv()]+)*) +([\S\s]+?)\n\x{2022} +([\S\s]+?)\n\d+ \| [-\dn]+\s*/gum

Demo Link

I have done my best to optimize the pattern. I shaved nearly 10,000 steps off of your pattern and reached 100 matches as desired.

Starting anchor ^ is used to identify start of each block (Efficiency / Accuracy)
\d is used instead of [0-9] (Brevity)
\s is replaced with a literal space where applicable (Brevity)
A character class of specific letters and parentheses was used in place of \w for capture group 3. (Efficiency) *could be replaced with [\w()] for brevity with a loss of efficiency
The bullet was specified using the literal \x{2022} (Personal preference)
Character class used on trailing characters of each block [-\dn]. (Efficiency / Accuracy)

0人赞添加讨论(0) 举报

姐就是有狂的资本

3楼-- · 2020-04-20 18:53

When you have to describe a long string with many parts, the first reflex is to use the free-space mode (x modifier) and named groups (even if named groups aren't very useful in a replacement context, they help to make the pattern readable and more easy to debug):

~^
(?<No> [0-9]+ )  \h+
(?<word> \pL+ )  \h+
(?<type> [\pL()]+ (?: , \h* [\pL()]+ )* )  \h+
(?<wd_tr> [^•]* [^•\s] )  \h* \R

• \h*
(?<sent_fr> [^–]* [^\s–] )   \s* – \s*
(?<sent_eng> .* (?:\R .*)*? )  \h* \R

(?<num1> [0-9]+ )  \h* \| \h*
(?<num2> .*\S )
~xum

demo

There are no magic recipe to build a pattern for a string with a blurred format. All you can do is to be the most constrictive at the beginning and to add flexibility when you encounter cases that don't match.

0人赞添加讨论(0) 举报

How to optionally add a comma and whitespace to a

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间