Regex select all text between tags

2019-01-04 07:05发布

What is the best way to select all the text between 2 tags - ex: the text between all the 'pre' tags on the page.

14条回答
够拽才男人
2楼-- · 2019-01-04 07:29

This is what I would use.

(?<=(<pre>))(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/®°⁰!?{}|`~]| )+?(?=(</pre>))

Basically what it does is:

(?<=(<pre>)) Selection have to be prepend with <pre> tag

(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/®°⁰!?{}|~]| ) This is just a regular expression I want to apply. In this case, it selects letter or digit or newline character or some special characters listed in the example in the square brackets. The pipe character | simply means "OR".

+? Plus character states to select one or more of the above - order does not matter. Question mark changes the default behavior from 'greedy' to 'ungreedy'.

(?=(</pre>)) Selection have to be appended by the </pre> tag

enter image description here

Depending on your use case you might need to add some modifiers like (i or m)

  • i - case-insensitive
  • m - multi-line search

Here I performed this search in Sublime Text so I did not have to use modifiers in my regex.

Javascript does not support lookbehind

The above example should work fine with languages such as PHP, Perl, Java ... Javascript, however, does not support lookbehind so we have to forget about using (?<=(<pre>)) and look for some kind of workaround. Perhaps simple strip the first four chars from our result for each selection like in here Regex match text between tags

Also look at the JAVASCRIPT REGEX DOCUMENTATION for non-capturing parentheses

查看更多
Fickle 薄情
3楼-- · 2019-01-04 07:32

You can use Pattern pattern = Pattern.compile( "[^<'tagname'/>]" );

查看更多
虎瘦雄心在
4楼-- · 2019-01-04 07:33

use the below pattern to get content between element. Replace [tag] with the actual element you wish to extract the content from.

<[tag]>(.+?)</[tag]>

Sometime tags will have attributes, like anchor tag having href, then use the below pattern.

 <[tag][^>]*>(.+?)</[tag]>
查看更多
Fickle 薄情
5楼-- · 2019-01-04 07:33

You shouldn't be trying to parse html with regexes see this question and how it turned out.

In the simplest terms, html is not a regular language so you can't fully parse is with regular expressions.

Having said that you can parse subsets of html when there are no similar tags nested. So as long as anything between and is not that tag itself, this will work:

preg_match("/<([\w]+)[^>]*>(.*?)<\/\1>/", $subject, $matches);
$matches = array ( [0] => full matched string [1] => tag name [2] => tag content )

A better idea is to use a parser, like the native DOMDocument, to load your html, then select your tag and get the inner html which might look something like this:

$obj = new DOMDocument();
$obj -> load($html);
$obj -> getElementByTagName('el');
$value = $obj -> nodeValue();

And since this is a proper parser it will be able to handle nesting tags etc.

查看更多
Melony?
6楼-- · 2019-01-04 07:33

For multiple lines:

<htmltag>(.+)((\s)+(.+))+</htmltag>
查看更多
男人必须洒脱
7楼-- · 2019-01-04 07:37

Tag can be completed in another line. This is why \n needs to be added.

<PRE>(.|\n)*?<\/PRE>
查看更多
登录 后发表回答