I have this regex expression that searches for a phone number pattern:
[(]?\d{3}[)]?[(\s)?.-]\d{3}[\s.-]\d{4}
This matches phone numbers in this format:
123 456 7890
(123)456 7890
(123) 456 7890
(123)456-7890
(123) 456-7890
123.456.7890
123-456-7890
I want to scan an entire page (with JavaScript) looking for this match, but excluding this match that already exists inside an anchor. After the match is found, I want to convert the phone number into a click to call link for mobile devices:
(123) 456-7890 --> <a href="tel:1234567890">(123) 456-7890</a>
I'm pretty sure I need to do a negative lookup. I've tried this, but this doesn't seem to be the right idea:
(?!.*(\<a href.*?\>))[(]?\d{3}[)]?[(\s)?.-]\d{3}[\s.-]\d{4}
Don't use regular expressions to parse HTML. Use HTML/DOM parsers to get the text nodes (the browser can filter it down for you, to remove anchor tags and all text too short to contain a phone number for instance) and you can check the text directly.
For example, with XPath (which is a bit ugly, but has support for dealing with text nodes directly in a way most other DOM methods do not):
For the record, the basic filters help a lot on most pages. For example, on this page, right now, as I see it (will vary by user, browser, browser extensions and scripts, etc.) without the filters, the snapshot for the query
'descendant-or-self::text()'
would have 1794 items. Omitting text parented by anchor tags,'descendant-or-self::text()[not(parent::A)]'
gets it down to 1538, and the full query, verifying that the non-whitespace content is at least twelve characters long gets it down to 87 items. Applying the regex to 87 items is chump change, performance-wise, and you've removed the need to parse HTML with an unsuitable tool.Use this as your regex:
Use this as your replace string:
This finds all phone numbers, both outside and inside of href tags, however, in all cases it returns the phone number itself as specific regex groups. Therefore, you can enclose each phone number found inside new href tags, because, where they exist, you are replacing the original href tags.
A regex group or "capture group" captures a specific part of what matched the overall regex expression. They are created by enclosing part of the regex in parenthesis. These groups are numbered from left to right by order of their opening parenthesis and the part of the input they match can be reference by placing a
$
in front of that number in Javascript. Other implementations use\
for this purpose. This is called a back reference. Back references can appear later in your regex expression, or in your replacement string (as done earlier in this answer). More information: http://www.regular-expressions.info/backref.htmlTo use a simpler example, suppose you had a document containing account numbers and other information. Each account number is proceeded by the word "account", which you want to change to "acct", but "account" appears elsewhere in the document so you cannot simply do a find and replace on it alone. You could use a regex of
account ([0-9]+)
. In this regex,([0-9]+)
forms a group which will match the actual account number, which we can back reference as$1
in our replacement string, which becomesacct $1
.You can test this out here: http://regexr.com/