-->

Splitting up html code tags and content

2019-05-21 21:13发布

问题:

Does anyone with more knowledge than me about regular expressions know how to split up html code so that all tags and all words are seperated ie.

<p>Some content <a href="www.test.com">A link</a></p>

Is seperated like this:

array = { [0]=>"<p>",
          [1]=>"Some",
          [2]=>"content",
          [3]=>"<a href='www.test.com'>,
          [4]=>"A",
          [5]=>"Link",
          [6]=>"</a>",
          [7]=>"</p>"

I've been using preg_split so far and have either successfully managed to split the string by whitespace or split by tags - but then all the content is in one array element when I eed this to be split to.

Anyone help me out?

回答1:

preg_split shouldn't be used in that case. Try preg_match_all:

$text = '<p>Some content <a href="www.test.com">A link</a></p>';
preg_match_all('/<[^>]++>|[^<>\s]++/', $text, $tokens);
print_r($tokens);

output:

Array
(
    [0] => Array
        (
            [0] => <p>
            [1] => Some
            [2] => content
            [3] => <a href="www.test.com">
            [4] => A
            [5] => link
            [6] => </a>
            [7] => </p>
        )

)

I assume you forgot to include the 'A' in 'A link' in your example.

Realize that when your HTML contains < or >'s not meant as the start or end of tags, regex will mess things up badly! (hence the warnings)



回答2:

You could check out Simple HTML DOM Parser

Or look at the DOM parser in PHP



回答3:

Give Simple HTML Dom Parser a try. HTML is too irregular for regular expressions.



回答4:

I currently use Simple HTML DOM Parser in several applications and find it to be an excellent tool, even when compared against other HTML parsers written in other languages.

Why exactly are you splitting up HTML into the string of tokens you described? Is not a tree-like structure of DOM elements a better approach for your specific application?