What is the regex expression for CDATA

Hi I have an example CDATA here

<![CDATA[asd[f]]]>

and

<tag1><![CDATA[asd[f]]]></tag1><tag2><![CDATA[asd[f]]]></tag2>

The CDATA regex i have is not able to recognize this

"<![CDATA["([^\]]|"]"[^\]]|"]]"[^>])*"]]>"

this does not work too

"<![CDATA["[^\]]*[\]]{2,}([^\]>][^\]]*[\]]{2,})*">"

Will someone please give me a regex for <![CDATA[asd[f]]]>, I need to use it in Lex/Flex

: I have answered this question, please vote on my answer, thanks.

标签： xml regex parsing cdata lex

6条回答

萌系小妹纸

2楼-- · 2020-06-17 09:32

This is the solution. The reason we need to use a START STATE is so that what ever is between <!CDATA[ and ]]> does not get match against other REGEX.

%option noyywrap
%x CDATA

%%
"<![CDATA[" { BEGIN CDATA; printf("Entering CDATA\n"); }
<CDATA>([^\]]|\n)*|.    { printf("In CDATA: %s\n", yytext); }
<CDATA>"]]>" { 
    printf("End of CDATA\n");
    BEGIN INITIAL;
}

%%
main()
{
    yylex();
}

0人赞添加讨论(0) 举报

我想做一个坏孩纸

3楼-- · 2020-06-17 09:33

I believe this other SO answer may be of some help, even though they're grabbing HTML contents and is .NET.

There are other answers with various options for grabbing CDATA in that same question.

CHAD's answer:

<!\[CDATA\[(.*?)\]\]>

Matching against:

<![CDATA[asd[f]]]>

retrieves:

asd[f]

According to FlexRegEx anyways.

0人赞添加讨论(0) 举报

Deceive 欺骗

4楼-- · 2020-06-17 09:34

One note - a search for CDATA should rule out comments as well, CDATA could be embedded.
/<!(?:\[CDATA\[(.*?)\]\]|--.*?--|\[[A-Z][A-Z\ ]*\[.*?\]\])>/sg
This could be done by checking if group 1 is valid upon each match returned in a global search.

0人赞添加讨论(0) 举报

在下西门庆

5楼-- · 2020-06-17 09:47

<!\[CDATA\[\s*(?:.(?<!\]\]>)\s*)*\]\]>

Previuos answer just modified

0人赞添加讨论(0) 举报

走好不送

6楼-- · 2020-06-17 09:53

The problem is that this is rather awkward to match with the sort of regular expressions used in lex; if you had a system that supported EREs, then you'd be able to do either:

<!\[CDATA\[(.*?)\]\]>

<!\[CDATA\[((?:[^]]|\](?!\]>))*)\]\]>

(The first uses non-greedy quantifiers, the second uses negative lookahead constraints. OK, it uses non-capturing parens too, but you can use capturing ones there instead; that's not so important.)

It's probably easier to handle this by using a similar strategy to the way C-style comments are handled in lex, by having one rule to match the start of the CDATA (on <![CDATA[) and put the lexer into a separate state that it leaves on seeing ]]>, while collecting all the characters in-between. This is instructive on the topic (and it seems that this is an area where flex and lex differ) and it covers all the strategies that you can take to make this work.

Note that cause of all these problems are because it's very difficult to write a rule with simple regular expressions that expresses the fact that a greedy regular expression must only match a ] if it is not followed by ]>. It's much easier to do if you've only got a two-character (or single character!) end-of-interesting-section sequence because you don't need such an elaborate state machine.

0人赞添加讨论(0) 举报

▲ chillily

7楼-- · 2020-06-17 09:55

Easy enough, it should be this:

<!\[CDATA\[.*?\]\]>

At least it works on regexpal.com

0人赞添加讨论(0) 举报

What is the regex expression for CDATA

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间