I am trying to match all HTML tags that do not have the attribute "term" or "range"
here is sample HTML format
<span class="inline prewrap strong">DATE:</span> 12/01/10
<span class="inline prewrap strong">MR:</span> 1234567
<span class="inline prewrap strong">DOB:</span> 12/01/65
<span class="inline prewrap strong">HISTORY OF PRESENT ILLNESS:</span> Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum
<span class="inline prewrap strong">MEDICATIONS:</span> <span term="Advil" range="true">Advil </span>and Ibuprofen.
My regex is: <(.*?)((?!\bterm\b).)>
Unfortunately this matches all the tags...It would be nice if the inner text wouldn't be matched as i need to filter out all the tags except the ones with that specific attribute.
This will do what you want. It is written for a Perl program, and the format may differ depending on what language you are using
The code below demonstrates this pattern in a Perl program
OUTPUT
I think you should use an HTML parser to solve this problem. Creating own regular expression is possible but erroneous for sure. Imagine that your code contains such expression
It is also valid, but to consider all possible spaces and TAB characters in your regular expression would be not easy and would require testing before you can be sure that it works as it is expected.
If regex is your thing for this, this works for me. (Note - filterring out comments, doctype and other entities is not included.
Other warnings; tags could be embeded in script, comments and other things.)
span tag (w/ attr) no term|range attrs
any tag (w/ attr) no term|range attrs
any tag (w/o attr) no term|range attrs
Update
Alternative to using (?>) construct
Below regex's are for no-'term|range'-attributes
Flags = (g)global and (s)dotall
span tag w/attr
link: http://regexr.com?2vrjr
regex:
<span(?=\s)(?!(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)(?:term|range)\s*=)(?!\s*/?>)\s+(?:".*?"|\'.*?\'|[^>]*?)+>
any tag w/attr
link: http://regexr.com?2vrju
regex:
<[A-Za-z_:][\w:.-]*(?=\s)(?!(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)(?:term|range)\s*=)(?!\s*/?>)\s+(?:".*?"|\'.*?\'|[^>]*?)+>
any tag w/attr or wo/attr
link: http://regexr.com?2vrk1
regex:
<(?:[A-Za-z_:][\w:.-]*(?=\s)(?!(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)(?:term|range)\s*=)(?!\s*/?>)\s+(?:".*?"|\'.*?\'|[^>]*?)+|/?[A-Za-z_:][\w:.-]*\s*/?)>
'to match every tag except the ones that have term="occasionally"'
link: http://regexr.com?2vrka
<(?:[A-Za-z_:][\w:.-]*(?=\s)(?!(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)term\s*=\s*(["'])\s*occasionally\s*\1)(?!\s*/?>)\s+(?:".*?"|\'.*?\'|[^>]*?)+|/?[A-Za-z_:][\w:.-]*\s*/?)>
I think this regex will work properly.
This regex will select style attribute of any HTML tag.
You can check this on https://regex101.com