How do I remove specific whitespace inside xml tag

2019-08-13 03:04发布

问题:

I have a file with some xml tags that follow specific patterns (Name and Props are placeholders)

<Name id="mod:Name"/>
<Prop1 Name id="mod:object.Prop1 Name"/>
<Prop1 Prop2 Name id="mod:object.Prop1 Prop2 Name"/>
<Prop1 Prop2 Prop3 Name id="mod:object.Prop1 Prop2 Prop3 Name"/>

I am looking for regex to remove whitespace from portion before the "id=..."

How this should look

<Name id="mod:Name"/>
<Prop1Name id="mod:object.Prop1 Name"/>
<Prop1Prop2Name id="mod:object.Prop1 Prop2 Name"/>
<Prop1Prop2Prop3Name id="mod:object.Prop1 Prop2 Prop3 Name"/>

I have seen the (\S+)\s(?=\S+\s+) example with the substitution being just \1 but that removes all the spaces except the last one and doesn't leave a space before the id=

<Name id="mod:Name"/>
<Prop1Name id="mod:object.Prop1 Name"/>
<Prop1Prop2Name id="mod:object.Prop1Prop2 Name"/>
<Prop1Prop2Prop3Name id="mod:object.Prop1Prop2Prop3 Name"/>

I tried something like

^((\S+)*)\s((\S+)*)\s((\S+)*)\s((\S+)*)\s(?=id)

But that gave me catastrophic backtracking

Not sure if it helps but Sublime uses Boost regex

First question on The Stack so any improvements on question would be welcome

Thank you

This seems to work

^(?|((\S+))\s|((\S+)\s(\S+))\s|((\S+)\s(\S+)\s(\S+)\s))(id=.*)

with replace of $2$3$4 $5

Thanks for the advice

回答1:

A correct regex for removing all whitespaces before the id attribute will be

(?:<\w+|(?!^)\G)\K\s+(\w+)(?=[^<>]*\bid=")

Replace with $1. See the regex demo.

The regex uses the \G operator (matches the location after the last successful match if restricted with (?!^) lookahead) and the \K operator that discards the text that was matched by the pattern so far.

Breakdown:

  • (?:<\w+|(?!^)\G)\K - match < followed with 1+ alphanumeric or underscore characters or the end of the last successful match and omit the text found
  • \s+ - match 1+ whitespace symbols
  • (\w+) - match and capture into Group 1 one or more alphanumeric or underscore characters (we'll later use a $1 backreference to restore this consumed text in the result)
  • (?=[^<>]*\bid=") - only go on matching spaces followed with alphanumerics until it finds id= as a whole word (\b is a word boundary) but inside the tag (due to the [^<>]* matching zero or more characters other than < and >).

A faster alternative (to replace with empty string):

(?:<|(?!^)\G)\w+\K\s+(?!id=)

This regex matches the < or the end of the last successful match, then one or more word characters, then \K will omit the whole text from the match, and only 1 or more whitespaces will be matched (if not followed with id= due to the negative lookahead (?!id=)) in the end - and they will be removed.