Use Regex to find a phone number on a page not in

2019-01-29 02:15发布

问题:

I have this regex expression that searches for a phone number pattern:

[(]?\d{3}[)]?[(\s)?.-]\d{3}[\s.-]\d{4}

This matches phone numbers in this format:

123 456 7890
(123)456 7890
(123) 456 7890
(123)456-7890
(123) 456-7890
123.456.7890
123-456-7890

I want to scan an entire page (with JavaScript) looking for this match, but excluding this match that already exists inside an anchor. After the match is found, I want to convert the phone number into a click to call link for mobile devices:

(123) 456-7890 --> <a href="tel:1234567890">(123) 456-7890</a>

I'm pretty sure I need to do a negative lookup. I've tried this, but this doesn't seem to be the right idea:

(?!.*(\<a href.*?\>))[(]?\d{3}[)]?[(\s)?.-]\d{3}[\s.-]\d{4}

回答1:

Don't use regular expressions to parse HTML. Use HTML/DOM parsers to get the text nodes (the browser can filter it down for you, to remove anchor tags and all text too short to contain a phone number for instance) and you can check the text directly.

For example, with XPath (which is a bit ugly, but has support for dealing with text nodes directly in a way most other DOM methods do not):

// This query finds all text nodes with at least 12 non-whitespace characters
// who are not direct children of an anchor tag
// Letting XPath apply basic filters dramatically reduces the number of elements
// you need to process (there are tons of short and/or pure whitespace text nodes
// in most DOMs)
var xpr = document.evaluate('descendant-or-self::text()[not(parent::A) and string-length(normalize-space(self::text())) >= 12]',
                            document.body, null, XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE, null);
for (var i=0, len=xpr.snapshotLength; i < len; ++i) {
    var txt = xpr.snapshotItem(i);
    // Splits with grouping to preserve the text split on
    var numbers = txt.data.split(/([(]?\d{3}[)]?[(\s)?.-]\d{3}[\s.-]\d{4})/);
    // split will return at least three items on a hit, prefix, split match, and suffix
    if (numbers.length >= 3) {
        var parent = txt.parentNode; // Save parent before replacing child
        // Replace contents of parent with text before first number
        parent.textContent = numbers[0];

        // Now explicitly create pairs of anchors and following text nodes
        for (var j = 1; j < numbers.length; j += 2) {
            // Operate in pairs; odd index is phone number, even is 
            // text following that phone number
            var anc = document.createElement('a');
            anc.href = 'tel:' + numbers[j].replace(/\D+/g, '');
            anc.textContent = numbers[j];
            parent.appendChild(anc);
            parent.appendChild(document.createTextNode(numbers[j+1]));
        }
        parent.normalize(); // Normalize whitespace after rebuilding
    }
}

For the record, the basic filters help a lot on most pages. For example, on this page, right now, as I see it (will vary by user, browser, browser extensions and scripts, etc.) without the filters, the snapshot for the query 'descendant-or-self::text()' would have 1794 items. Omitting text parented by anchor tags, 'descendant-or-self::text()[not(parent::A)]' gets it down to 1538, and the full query, verifying that the non-whitespace content is at least twelve characters long gets it down to 87 items. Applying the regex to 87 items is chump change, performance-wise, and you've removed the need to parse HTML with an unsuitable tool.



回答2:

Use this as your regex:

(<a href.*?>.*?([(]?(\d{3})[)]?[(\s)?.-](\d{3})[\s.-](\d{4})).*?<\/a>)|([(]?(\d{3})[)]?[(\s)?.-](\d{3})[\s.-](\d{4}))

Use this as your replace string:

<a href="tel:$3$7$4$8$5$9">($3$7) $4$8-$5$9</a>

This finds all phone numbers, both outside and inside of href tags, however, in all cases it returns the phone number itself as specific regex groups. Therefore, you can enclose each phone number found inside new href tags, because, where they exist, you are replacing the original href tags.

A regex group or "capture group" captures a specific part of what matched the overall regex expression. They are created by enclosing part of the regex in parenthesis. These groups are numbered from left to right by order of their opening parenthesis and the part of the input they match can be reference by placing a $ in front of that number in Javascript. Other implementations use \ for this purpose. This is called a back reference. Back references can appear later in your regex expression, or in your replacement string (as done earlier in this answer). More information: http://www.regular-expressions.info/backref.html

To use a simpler example, suppose you had a document containing account numbers and other information. Each account number is proceeded by the word "account", which you want to change to "acct", but "account" appears elsewhere in the document so you cannot simply do a find and replace on it alone. You could use a regex of account ([0-9]+). In this regex, ([0-9]+) forms a group which will match the actual account number, which we can back reference as $1 in our replacement string, which becomes acct $1.

You can test this out here: http://regexr.com/