regex to match html tags with specific attributes

I am trying to match all HTML tags that do not have the attribute "term" or "range"

here is sample HTML format

<span class="inline prewrap strong">DATE:</span>    12/01/10
<span class="inline prewrap strong">MR:</span>  1234567
<span class="inline prewrap strong">DOB:</span> 12/01/65
<span class="inline prewrap strong">HISTORY OF PRESENT ILLNESS:</span>  Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum

<span class="inline prewrap strong">MEDICATIONS:</span>  <span term="Advil" range="true">Advil </span>and Ibuprofen.

My regex is: <(.*?)((?!\bterm\b).)>

Unfortunately this matches all the tags...It would be nice if the inner text wouldn't be matched as i need to filter out all the tags except the ones with that specific attribute.

标签： regex pattern-matching string-matching

5条回答

神经病院院长

2楼-- · 2019-03-17 07:21

This will do what you want. It is written for a Perl program, and the format may differ depending on what language you are using

/(?! [^>]+ \b(?:item|range)= ) (<[a-z]+.*?>) /igx

The code below demonstrates this pattern in a Perl program

use strict;
use warnings;

my $pattern = qr/ (?! [^>]+ \b(?:item|range)= ) (<[a-z]+.*?>) /ix;

my $str = <<'END';

<span class="inline prewrap strong">DATE:</span>    12/01/10
<span class="inline prewrap strong">MR:</span>  1234567
<span class="inline prewrap strong">DOB:</span> 12/01/65
<span class="inline prewrap strong">HISTORY OF PRESENT ILLNESS:</span>  Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum

<span class="inline prewrap strong">MEDICATIONS:</span>  <span term="Advil" range="true">Advil </span>and Ibuprofen.

END

print "$_\n" foreach $str =~ /$pattern/g;

OUTPUT

<span class="inline prewrap strong">
<span class="inline prewrap strong">
<span class="inline prewrap strong">
<span class="inline prewrap strong">
<span class="inline prewrap strong">

0人赞添加讨论(0) 举报

forever°为你锁心

3楼-- · 2019-03-17 07:35

I think you should use an HTML parser to solve this problem. Creating own regular expression is possible but erroneous for sure. Imagine that your code contains such expression

< span      class = "a"              >b< / span         >

It is also valid, but to consider all possible spaces and TAB characters in your regular expression would be not easy and would require testing before you can be sure that it works as it is expected.

0人赞添加讨论(0) 举报

Fickle 薄情

4楼-- · 2019-03-17 07:40

If regex is your thing for this, this works for me. (Note - filterring out comments, doctype and other entities is not included.
Other warnings; tags could be embeded in script, comments and other things.)

span tag (w/ attr) no term|range attrs

'<span
  (?=\s)
  (?! (?:[^>"\']|(?>".*?"|\'.*?\'))*? (?<=\s) (?:term|range) \s*= )
  \s+ (?:".*?"|\'.*?\'|[^>]*?)+ 
>'

any tag (w/ attr) no term|range attrs

'<[A-Za-z_:][\w:.-]*
  (?=\s)
  (?! (?:[^>"\']|(?>".*?"|\'.*?\'))*? (?<=\s) (?:term|range) \s*= )
  \s+ (?:".*?"|\'.*?\'|[^>]*?)+ 
>'

any tag (w/o attr) no term|range attrs

'<
  (?:
    [A-Za-z_:][\w:.-]*
    (?=\s)
    (?! (?:[^>"\']|(?>".*?"|\'.*?\'))*? (?<=\s) (?:term|range) \s*= )
    \s+ (?:".*?"|\'.*?\'|[^>]*?)+ 
  |
    /?[A-Za-z_:][\w:.-]*\s*/?
  )
>'

Update

Alternative to using (?>) construct
Below regex's are for no-'term|range'-attributes
Flags = (g)global and (s)dotall

span tag w/attr
link: http://regexr.com?2vrjr
regex: <span(?=\s)(?!(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)(?:term|range)\s*=)(?!\s*/?>)\s+(?:".*?"|\'.*?\'|[^>]*?)+>

any tag w/attr
link: http://regexr.com?2vrju
regex: <[A-Za-z_:][\w:.-]*(?=\s)(?!(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)(?:term|range)\s*=)(?!\s*/?>)\s+(?:".*?"|\'.*?\'|[^>]*?)+>

any tag w/attr or wo/attr
link: http://regexr.com?2vrk1
regex: <(?:[A-Za-z_:][\w:.-]*(?=\s)(?!(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)(?:term|range)\s*=)(?!\s*/?>)\s+(?:".*?"|\'.*?\'|[^>]*?)+|/?[A-Za-z_:][\w:.-]*\s*/?)>

'to match every tag except the ones that have term="occasionally"'

link: http://regexr.com?2vrka
<(?:[A-Za-z_:][\w:.-]*(?=\s)(?!(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)term\s*=\s*(["'])\s*occasionally\s*\1)(?!\s*/?>)\s+(?:".*?"|\'.*?\'|[^>]*?)+|/?[A-Za-z_:][\w:.-]*\s*/?)>

0人赞添加讨论(0) 举报

女痞

5楼-- · 2019-03-17 07:43

I think this regex will work properly.

This regex will select style attribute of any HTML tag.

<\s*\w*\s*style.*?>

You can check this on https://regex101.com

0人赞添加讨论(0) 举报

啃猪蹄的小仙女

6楼-- · 2019-03-17 07:47

<\w+\s+(?!term).*?>(.*?)</.*?>

0人赞添加讨论(0) 举报

regex to match html tags with specific attributes

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间