PHP RegExp for nested Div tags

I need a regexp I can use with PHP's preg_match_all() to match out content inside div-tags. The divs look like this:

<div id="t1">Content</div>

I've come up with this regexp so far which matches out all divs with id="t[number]"

/<div id="t(\\d)">(.*?)<\\/div>/

The problem is when the content consists of more divs, nested divs like this:

<div id="t1">Content <div>more stuff</div></div>

Any ideas on how I make my regexp work with nested tags?

Thanks

标签： php regex tags nested

4条回答

Viruses.

2楼-- · 2019-02-07 13:57

i think it will be better to use some DOM-instruments

0人赞添加讨论(0) 举报

Fickle 薄情

3楼-- · 2019-02-07 14:00

As I recently found out, regex can't do that.

Matching pair tag with regex

I ended up using xpath, and it works like a charm

0人赞添加讨论(0) 举报

Emotional °昔

4楼-- · 2019-02-07 14:09

Try a parser instead:

require_once "simple_html_dom.php";
$text = 'foo <div id="t1">Content <div>more stuff</div></div> bar <div>even more</div> baz  <div id="t2">yes</div>';
$html = str_get_html($text);
foreach($html->find('div') as $e) {
    if(isset($e->attr['id']) && preg_match('/^t\d++/', $e->attr['id'])) {
        echo $e->outertext . "\n";
    }
}

Output:

<div id="t1">Content <div>more stuff</div></div>
<div id="t2">yes</div>

Download the parser here: http://simplehtmldom.sourceforge.net/

Edit: More for my own amusement I tried to do it in regex. Here's what I came up with:

$text = 'foo <div id="t1">Content <div>more stuff</div></div> bar <div>even more</div>
      baz <div id="t2">yes <div>aaa<div>bbb<div>ccc</div>bbb</div>aaa</div> </div>';
if(preg_match_all('#<div\s+id="t\d+">[^<>]*(<div[^>]*>(?:[^<>]*|(?1))*</div>)[^<>]*</div>#si', $text, $matches)) {
    print_r($matches[0]);
}

Output:

Array
(
    [0] => <div id="t1">Content <div>more stuff</div></div>
    [1] => <div id="t2">yes <div>aaa<div>bbb<div>ccc</div>bbb</div>aaa</div> </div>
)

And a small explanation:

<div\s+id="t\d+">  # match an opening 'div' with an id that starts with 't' and some digits
[^<>]*             # match zero or more chars other than '<' and '>'
(                  # open group 1
  <div[^>]*>       #   match an opening 'div'
  (?:              #   open a non-matching group
    [^<>]*         #     match zero or more chars other than '<' and '>'
    |              #     OR
    (?1)           #     recursively match what is defined by group 1
  )*               #   close the non-matching group and repeat it zero or more times
  </div>           #   match a closing 'div'
)                  # close group 1
[^<>]*             # match zero or more chars other than '<' and '>'
</div>             # match a closing 'div'

Now perhaps you understand why people try to persuade you from not using regex for this. As already noted, it will not help if the the html is improperly formed: the regex will make a bigger mess of the output than an html parser, I assure you. Also, the regex will probably make your eyes bleed and your colleagues (or the people who will maintain your software) may come looking for you after seeing what you did. :)

Your best bet is to first clean up your input (using TIDY or similar), and then use a parser to get the info you want.

0人赞添加讨论(0) 举报

对你真心纯属浪费

5楼-- · 2019-02-07 14:11

If you believe this guy, there's at least one regex that does the trick, and he says it's faster than dom methods... I agree with him.

http://www.php.net/manual/fr/regexp.reference.recursive.php#95568

$pattern = "/<([\w]+)([^>]*?) (([\s]*\/>)| (>((([^<]*?|<\!\-\-.*?\-\->)| (?R))*)<\/\\1[\s]*>))/xsm";

0人赞添加讨论(0) 举报

PHP RegExp for nested Div tags

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间