Use Regex to find a phone number on a page not in

I have this regex expression that searches for a phone number pattern:

[(]?\d{3}[)]?[(\s)?.-]\d{3}[\s.-]\d{4}

This matches phone numbers in this format:

123 456 7890
(123)456 7890
(123) 456 7890
(123)456-7890
(123) 456-7890
123.456.7890
123-456-7890

I want to scan an entire page (with JavaScript) looking for this match, but excluding this match that already exists inside an anchor. After the match is found, I want to convert the phone number into a click to call link for mobile devices:

(123) 456-7890 --> <a href="tel:1234567890">(123) 456-7890</a>

I'm pretty sure I need to do a negative lookup. I've tried this, but this doesn't seem to be the right idea:

(?!.*(\<a href.*?\>))[(]?\d{3}[)]?[(\s)?.-]\d{3}[\s.-]\d{4}

标签： javascript regex negative-lookahead

2条回答

倾城　Initia

2楼-- · 2019-01-29 02:39

Don't use regular expressions to parse HTML. Use HTML/DOM parsers to get the text nodes (the browser can filter it down for you, to remove anchor tags and all text too short to contain a phone number for instance) and you can check the text directly.

For example, with XPath (which is a bit ugly, but has support for dealing with text nodes directly in a way most other DOM methods do not):

// This query finds all text nodes with at least 12 non-whitespace characters
// who are not direct children of an anchor tag
// Letting XPath apply basic filters dramatically reduces the number of elements
// you need to process (there are tons of short and/or pure whitespace text nodes
// in most DOMs)
var xpr = document.evaluate('descendant-or-self::text()[not(parent::A) and string-length(normalize-space(self::text())) >= 12]',
                            document.body, null, XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE, null);
for (var i=0, len=xpr.snapshotLength; i < len; ++i) {
    var txt = xpr.snapshotItem(i);
    // Splits with grouping to preserve the text split on
    var numbers = txt.data.split(/([(]?\d{3}[)]?[(\s)?.-]\d{3}[\s.-]\d{4})/);
    // split will return at least three items on a hit, prefix, split match, and suffix
    if (numbers.length >= 3) {
        var parent = txt.parentNode; // Save parent before replacing child
        // Replace contents of parent with text before first number
        parent.textContent = numbers[0];

        // Now explicitly create pairs of anchors and following text nodes
        for (var j = 1; j < numbers.length; j += 2) {
            // Operate in pairs; odd index is phone number, even is 
            // text following that phone number
            var anc = document.createElement('a');
            anc.href = 'tel:' + numbers[j].replace(/\D+/g, '');
            anc.textContent = numbers[j];
            parent.appendChild(anc);
            parent.appendChild(document.createTextNode(numbers[j+1]));
        }
        parent.normalize(); // Normalize whitespace after rebuilding
    }
}

For the record, the basic filters help a lot on most pages. For example, on this page, right now, as I see it (will vary by user, browser, browser extensions and scripts, etc.) without the filters, the snapshot for the query 'descendant-or-self::text()' would have 1794 items. Omitting text parented by anchor tags, 'descendant-or-self::text()[not(parent::A)]' gets it down to 1538, and the full query, verifying that the non-whitespace content is at least twelve characters long gets it down to 87 items. Applying the regex to 87 items is chump change, performance-wise, and you've removed the need to parse HTML with an unsuitable tool.

0人赞添加讨论(0) 举报

Luminary・发光体

3楼-- · 2019-01-29 02:55

Use this as your regex:

(<a href.*?>.*?([(]?(\d{3})[)]?[(\s)?.-](\d{3})[\s.-](\d{4})).*?<\/a>)|([(]?(\d{3})[)]?[(\s)?.-](\d{3})[\s.-](\d{4}))

Use this as your replace string:

<a href="tel:$3$7$4$8$5$9">($3$7) $4$8-$5$9</a>

This finds all phone numbers, both outside and inside of href tags, however, in all cases it returns the phone number itself as specific regex groups. Therefore, you can enclose each phone number found inside new href tags, because, where they exist, you are replacing the original href tags.

A regex group or "capture group" captures a specific part of what matched the overall regex expression. They are created by enclosing part of the regex in parenthesis. These groups are numbered from left to right by order of their opening parenthesis and the part of the input they match can be reference by placing a $ in front of that number in Javascript. Other implementations use \ for this purpose. This is called a back reference. Back references can appear later in your regex expression, or in your replacement string (as done earlier in this answer). More information: http://www.regular-expressions.info/backref.html

To use a simpler example, suppose you had a document containing account numbers and other information. Each account number is proceeded by the word "account", which you want to change to "acct", but "account" appears elsewhere in the document so you cannot simply do a find and replace on it alone. You could use a regex of account ([0-9]+). In this regex, ([0-9]+) forms a group which will match the actual account number, which we can back reference as $1 in our replacement string, which becomes acct $1.

You can test this out here: http://regexr.com/

0人赞添加讨论(0) 举报

Use Regex to find a phone number on a page not in

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间