RegEx: Don't match a certain character if it&#

2019-04-07 04:50发布

问题:

Disclosure: I have read this answer many times here on SO and I know better than to use regex to parse HTML. This question is just to broaden my knowledge with regex.

Say I have this string:

some text <tag link="fo>o"> other text

I want to match the whole tag but if I use <[^>]+> it only matches <tag link="fo>.

How can I make sure that > inside of quotes can be ignored.

I can trivially write a parser with a while loop to do this, but I want to know how to do it with regex.

回答1:

Regular Expression:

<[^>]*?(?:(?:('|")[^'"]*?\1)[^>]*?)*>

Online demo:

http://regex101.com/r/yX5xS8

Full Explanation:

I know this regex might be a headache to look at, so here is my explanation:

<                      # Open HTML tags
    [^>]*?             # Lazy Negated character class for closing HTML tag
    (?:                # Open Outside Non-Capture group
        (?:            # Open Inside Non-Capture group
            ('|")      # Capture group for quotes, backreference group 1
            [^'"]*?    # Lazy Negated character class for quotes
            \1         # Backreference 1
        )              # Close Inside Non-Capture group
        [^>]*?         # Lazy Negated character class for closing HTML tag
    )*                 # Close Outside Non-Capture group
>                      # Close HTML tags


回答2:

This is a slight improvement on Vasili Syrakis answer. It handles "…" and '…' completely separately, and does not use the *? qualifier.

Regular expression

<[^'">]*(("[^"]*"|'[^']*')[^'">]*)*>

Demo

http://regex101.com/r/jO1oQ1

Explanation

<                    # start of HTML tag
    [^'">]*          #   any non-single, non-double quote or greater than
    (                #   outer group
        (            #     inner group
            "[^"]*"  #       "..."
        |            #      or
            '[^']*'  #       '...'
        )            #
        [^'">]*      #   any non-single, non-double quote or greater than
    )*               #   zero or more of outer group
>                    # end of HTML tag

This version is slightly better than Vasilis's in that single quotes are allowed inside "…", and double quotes are allowed inside '…', and that a (incorrect) tag like <a href='> will not be matched.

It is slightly worse than Vasili's solution in that the groups are captured. If you do not want that, replace ( with (?:, in all places. (Just using ( makes the regex shorter, and a little bit more readable).



回答3:

(<.+?>[^<]+>)|(<.+?>)

you can make two regexs than put them togather by using '|', in this case :

(<.+?>[^<]+>)   #will match  some text <tag link="fo>o"> other text
(<.+?>)         #will match  some text <tag link="foo"> other text

if the first case match, it will not use second regex, so make sure you put special case in the firstplace.



回答4:

If you want this to work with escaped double quotes, try:

/>(?=((?:[^"\\]|\\.)*"([^"\\]|\\.)*")*([^"\\]|\\.)*$)/g

For example:

const gtExp = />(?=((?:[^"\\]|\\.)*"([^"\\]|\\.)*")*([^"\\]|\\.)*$)/g;
const nextGtMatch = () => ((exec) => {
    return exec ? exec.index : -1;
})(gtExp.exec(xml));

And if you're parsing through a bunch of XML, you'll want to set .lastIndex.

gtExp.lastIndex = xmlIndex;
const attrEndIndex = nextGtMatch(); // the end of the tag's attributes