Disclosure: I have read this answer many times here on SO and I know better than to use regex to parse HTML. This question is just to broaden my knowledge with regex.
Say I have this string:
some text <tag link="fo>o"> other text
I want to match the whole tag but if I use <[^>]+>
it only matches <tag link="fo>
.
How can I make sure that >
inside of quotes can be ignored.
I can trivially write a parser with a while loop to do this, but I want to know how to do it with regex.
Regular Expression:
<[^>]*?(?:(?:('|")[^'"]*?\1)[^>]*?)*>
Online demo:
http://regex101.com/r/yX5xS8
Full Explanation:
I know this regex might be a headache to look at, so here is my explanation:
< # Open HTML tags
[^>]*? # Lazy Negated character class for closing HTML tag
(?: # Open Outside Non-Capture group
(?: # Open Inside Non-Capture group
('|") # Capture group for quotes, backreference group 1
[^'"]*? # Lazy Negated character class for quotes
\1 # Backreference 1
) # Close Inside Non-Capture group
[^>]*? # Lazy Negated character class for closing HTML tag
)* # Close Outside Non-Capture group
> # Close HTML tags
This is a slight improvement on Vasili Syrakis answer. It handles "…"
and '…'
completely separately, and does not use the *?
qualifier.
Regular expression
<[^'">]*(("[^"]*"|'[^']*')[^'">]*)*>
Demo
http://regex101.com/r/jO1oQ1
Explanation
< # start of HTML tag
[^'">]* # any non-single, non-double quote or greater than
( # outer group
( # inner group
"[^"]*" # "..."
| # or
'[^']*' # '...'
) #
[^'">]* # any non-single, non-double quote or greater than
)* # zero or more of outer group
> # end of HTML tag
This version is slightly better than Vasilis's in that single quotes are allowed inside "…"
, and double quotes are allowed inside '…'
, and that a (incorrect) tag like <a href='>
will not be matched.
It is slightly worse than Vasili's solution in that the groups are captured. If you do not want that, replace (
with (?:
, in all places. (Just using (
makes the regex shorter, and a little bit more readable).
(<.+?>[^<]+>)|(<.+?>)
you can make two regexs than put them togather by using '|',
in this case :
(<.+?>[^<]+>) #will match some text <tag link="fo>o"> other text
(<.+?>) #will match some text <tag link="foo"> other text
if the first case match, it will not use second regex, so make sure you put special case in the firstplace.
If you want this to work with escaped double quotes, try:
/>(?=((?:[^"\\]|\\.)*"([^"\\]|\\.)*")*([^"\\]|\\.)*$)/g
For example:
const gtExp = />(?=((?:[^"\\]|\\.)*"([^"\\]|\\.)*")*([^"\\]|\\.)*$)/g;
const nextGtMatch = () => ((exec) => {
return exec ? exec.index : -1;
})(gtExp.exec(xml));
And if you're parsing through a bunch of XML, you'll want to set .lastIndex
.
gtExp.lastIndex = xmlIndex;
const attrEndIndex = nextGtMatch(); // the end of the tag's attributes