regex to match html tags with specific attributes

2019-03-17 07:20发布

I am trying to match all HTML tags that do not have the attribute "term" or "range"

here is sample HTML format

<span class="inline prewrap strong">DATE:</span>    12/01/10
<span class="inline prewrap strong">MR:</span>  1234567
<span class="inline prewrap strong">DOB:</span> 12/01/65
<span class="inline prewrap strong">HISTORY OF PRESENT ILLNESS:</span>  Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum

<span class="inline prewrap strong">MEDICATIONS:</span>  <span term="Advil" range="true">Advil </span>and Ibuprofen.

My regex is: <(.*?)((?!\bterm\b).)>

Unfortunately this matches all the tags...It would be nice if the inner text wouldn't be matched as i need to filter out all the tags except the ones with that specific attribute.

5条回答
神经病院院长
2楼-- · 2019-03-17 07:21

This will do what you want. It is written for a Perl program, and the format may differ depending on what language you are using

/(?! [^>]+ \b(?:item|range)= ) (<[a-z]+.*?>) /igx

The code below demonstrates this pattern in a Perl program

use strict;
use warnings;

my $pattern = qr/ (?! [^>]+ \b(?:item|range)= ) (<[a-z]+.*?>) /ix;

my $str = <<'END';

<span class="inline prewrap strong">DATE:</span>    12/01/10
<span class="inline prewrap strong">MR:</span>  1234567
<span class="inline prewrap strong">DOB:</span> 12/01/65
<span class="inline prewrap strong">HISTORY OF PRESENT ILLNESS:</span>  Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum

<span class="inline prewrap strong">MEDICATIONS:</span>  <span term="Advil" range="true">Advil </span>and Ibuprofen.

END

print "$_\n" foreach $str =~ /$pattern/g;

OUTPUT

<span class="inline prewrap strong">
<span class="inline prewrap strong">
<span class="inline prewrap strong">
<span class="inline prewrap strong">
<span class="inline prewrap strong">
查看更多
forever°为你锁心
3楼-- · 2019-03-17 07:35

I think you should use an HTML parser to solve this problem. Creating own regular expression is possible but erroneous for sure. Imagine that your code contains such expression

< span      class = "a"              >b< / span         >

It is also valid, but to consider all possible spaces and TAB characters in your regular expression would be not easy and would require testing before you can be sure that it works as it is expected.

查看更多
Fickle 薄情
4楼-- · 2019-03-17 07:40

If regex is your thing for this, this works for me. (Note - filterring out comments, doctype and other entities is not included.
Other warnings; tags could be embeded in script, comments and other things.)

span tag (w/ attr) no term|range attrs

'<span
  (?=\s)
  (?! (?:[^>"\']|(?>".*?"|\'.*?\'))*? (?<=\s) (?:term|range) \s*= )
  \s+ (?:".*?"|\'.*?\'|[^>]*?)+ 
>'

any tag (w/ attr) no term|range attrs

'<[A-Za-z_:][\w:.-]*
  (?=\s)
  (?! (?:[^>"\']|(?>".*?"|\'.*?\'))*? (?<=\s) (?:term|range) \s*= )
  \s+ (?:".*?"|\'.*?\'|[^>]*?)+ 
>'

any tag (w/o attr) no term|range attrs

'<
  (?:
    [A-Za-z_:][\w:.-]*
    (?=\s)
    (?! (?:[^>"\']|(?>".*?"|\'.*?\'))*? (?<=\s) (?:term|range) \s*= )
    \s+ (?:".*?"|\'.*?\'|[^>]*?)+ 
  |
    /?[A-Za-z_:][\w:.-]*\s*/?
  )
>'

Update

Alternative to using (?>) construct
Below regex's are for no-'term|range'-attributes
Flags = (g)global and (s)dotall

span tag w/attr
link: http://regexr.com?2vrjr
regex: <span(?=\s)(?!(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)(?:term|range)\s*=)(?!\s*/?>)\s+(?:".*?"|\'.*?\'|[^>]*?)+>

any tag w/attr
link: http://regexr.com?2vrju
regex: <[A-Za-z_:][\w:.-]*(?=\s)(?!(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)(?:term|range)\s*=)(?!\s*/?>)\s+(?:".*?"|\'.*?\'|[^>]*?)+>

any tag w/attr or wo/attr
link: http://regexr.com?2vrk1
regex: <(?:[A-Za-z_:][\w:.-]*(?=\s)(?!(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)(?:term|range)\s*=)(?!\s*/?>)\s+(?:".*?"|\'.*?\'|[^>]*?)+|/?[A-Za-z_:][\w:.-]*\s*/?)>

'to match every tag except the ones that have term="occasionally"'

link: http://regexr.com?2vrka
<(?:[A-Za-z_:][\w:.-]*(?=\s)(?!(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)term\s*=\s*(["'])\s*occasionally\s*\1)(?!\s*/?>)\s+(?:".*?"|\'.*?\'|[^>]*?)+|/?[A-Za-z_:][\w:.-]*\s*/?)>

查看更多
女痞
5楼-- · 2019-03-17 07:43

I think this regex will work properly.

This regex will select style attribute of any HTML tag.

<\s*\w*\s*style.*?>

You can check this on https://regex101.com

查看更多
啃猪蹄的小仙女
6楼-- · 2019-03-17 07:47
<\w+\s+(?!term).*?>(.*?)</.*?>
查看更多
登录 后发表回答