Regular Expression for finding phone numbers [dupl

2019-01-26 08:45发布

问题:

Possible Duplicates:
A comprehensive regex for phone number validation
grep with regex for phone number

Hello Everyone,

I am new to Stackoverflow and I have a quick question. Let's assume we are given a large number of HTML files (large as in theoretically infinite). How can I use Regular Expressions to extract the list of Phone Numbers from all those files?

Explanation/expression will be really appreciated. The Phone numbers can be any of the following formats:

  • (123) 456 7899
  • (123).456.7899
  • (123)-456-7899
  • 123-456-7899
  • 123 456 7899
  • 1234567899

Thanks a lot for all your help and have a good one!

回答1:

/^[\.-)( ]*([0-9]{3})[\.-)( ]*([0-9]{3})[\.-)( ]*([0-9]{4})$/

Should accomplish what you are trying to do.

The first part ^ means the "start of the line" which will force it to account for the whole string.

The [\.-)( ]* that I have in there mean "any period, hyphen, parenthesis, or space appearing 0 or more times".

The ([0-9]{3}) clusters match a group of 3 numbers (the last one is set to match 4)

Hope that helps!



回答2:

Without knowing what language you're using I am unsure whether or not the syntax is correct.

This should match all of your groups with very few false positives:

/\(?([0-9]{3})\)?([ .-]?)([0-9]{3})\2([0-9]{4})/

The groups you will be interested in after the match are groups 1, 3, and 4. Group 2 exists only to make sure the first and second separator characters , ., or - are the same.

For example a sed command to strip the characters and leave phone numbers in the form 123456789:

sed "s/(\{0,1\}\([0-9]\{3\}\))\{0,1\}\([ .-]\{0,1\}\)\([0-9]\{3\}\)\2\([0-9]\{4\}\)/\1\3\4/"

Here are the false positives of my expression:

  • (123)456789
  • (123456789
  • (123 456 789
  • (123.456.789
  • (123-456-789
  • 123)456789
  • 123) 456 789
  • 123).456.789
  • 123)-456-789

Breaking up the expression into two parts, one that matches with parenthesis and one that does not will eliminate all of these false positives except for the first one:

/\(([0-9]{3})\)([ .-]?)([0-9]{3})\2([0-9]{4})|([0-9]{3})([ .-]?)([0-9]{3})\5([0-9]{4})/

Groups 1, 3, and 4 or 5, 7, and 8 would matter in this case.



回答3:

This will help you catch the ones with an area code in parentheses

([0-9]\{3\})[ .-][0-9]\{3\}[ .-][0-9]\{4\}

The others are:

[0-9]\{3\}[ -][0-9]\{3\}[ -][0-9]\{4\}
[0-9]\{10\}

I separated the first one and the second one because putting them together without backtracking could get you into accepting (123 456 7890 or 123) 456 7890

Note also that on my terminal using grep, I had to escape the { } for the repetition. You may not have to, or you may have to escape other characters depending on where you intend to use this.



回答4:

^(\(?\d{3}\)?)([ .-])(\d{3})([ .-])(\d{4})$

This should match all except the last pattern. For the last one you could use a separated pattern ^\d{10}$

And there is a error, it will match (123 456 7899

  1. ^(\(?\d{3}\)?), if we break this code, the first character (^) matches the beginning of the text. \(? and \)? will accept or not this character, there is the problem to do that you have to check if there was an opening char, if there were the second have to match, I don't know if it is possible using Regex only. And \d{3} will match three numbers

  2. ([ .-]) will match any of those, but only one and only once.

  3. (\d{3}) will match three numbers

  4. Same as 2

  5. (\d{4})$ four numbers followed by the end of the text ($)

Since you want to extract from an HTML page you would have to ignore ^ and $ to match any part of the text and set a flag global, in javascript /exp/g

You can test Regex here