What is the best way to select all the text between 2 tags - ex: the text between all the 'pre' tags on the page.

标签： html regex html-parsing

14条回答

2楼-- · 2019-01-04 07:29

This is what I would use.

(?<=(<pre>))(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/®°⁰!?{}|`~]| )+?(?=(</pre>))

Basically what it does is:

(?<=(<pre>)) Selection have to be prepend with <pre> tag

(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/®°⁰!?{}|~]| ) This is just a regular expression I want to apply. In this case, it selects letter or digit or newline character or some special characters listed in the example in the square brackets. The pipe character | simply means "OR".

+? Plus character states to select one or more of the above - order does not matter. Question mark changes the default behavior from 'greedy' to 'ungreedy'.

(?=(</pre>)) Selection have to be appended by the </pre> tag

Depending on your use case you might need to add some modifiers like (i or m)

i - case-insensitive
m - multi-line search

Here I performed this search in Sublime Text so I did not have to use modifiers in my regex.

Javascript does not support lookbehind

The above example should work fine with languages such as PHP, Perl, Java ... Javascript, however, does not support lookbehind so we have to forget about using (?<=(<pre>)) and look for some kind of workaround. Perhaps simple strip the first four chars from our result for each selection like in here Regex match text between tags

Also look at the JAVASCRIPT REGEX DOCUMENTATION for non-capturing parentheses

0人赞添加讨论(0) 举报

Fickle 薄情

3楼-- · 2019-01-04 07:32

You can use Pattern pattern = Pattern.compile( "[^<'tagname'/>]" );

0人赞添加讨论(0) 举报

虎瘦雄心在

4楼-- · 2019-01-04 07:33

use the below pattern to get content between element. Replace [tag] with the actual element you wish to extract the content from.

<[tag]>(.+?)</[tag]>

Sometime tags will have attributes, like anchor tag having href, then use the below pattern.

 <[tag][^>]*>(.+?)</[tag]>

0人赞添加讨论(0) 举报

Fickle 薄情

5楼-- · 2019-01-04 07:33

You shouldn't be trying to parse html with regexes see this question and how it turned out.

In the simplest terms, html is not a regular language so you can't fully parse is with regular expressions.

Having said that you can parse subsets of html when there are no similar tags nested. So as long as anything between and is not that tag itself, this will work:

preg_match("/<([\w]+)[^>]*>(.*?)<\/\1>/", $subject, $matches);
$matches = array ( [0] => full matched string [1] => tag name [2] => tag content )

A better idea is to use a parser, like the native DOMDocument, to load your html, then select your tag and get the inner html which might look something like this:

$obj = new DOMDocument();
$obj -> load($html);
$obj -> getElementByTagName('el');
$value = $obj -> nodeValue();

And since this is a proper parser it will be able to handle nesting tags etc.

0人赞添加讨论(0) 举报

Melony?

6楼-- · 2019-01-04 07:33

For multiple lines:

<htmltag>(.+)((\s)+(.+))+</htmltag>

0人赞添加讨论(0) 举报

男人必须洒脱

7楼-- · 2019-01-04 07:37

Tag can be completed in another line. This is why \n needs to be added.

<PRE>(.|\n)*?<\/PRE>

0人赞添加讨论(0) 举报

1 2 3 下一页

Regex select all text between tags

This is what I would use.

Javascript does not support lookbehind

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间