How to extract expression matching an email addres

I have a text file that contains email addresses and some information.

I would like to know how can I extract those email address using R or the terminal?

I've read that I can used some regular expression that would match an email address such as

"^[_a-z0-9-]+(\\.[_a-z0-9-]+)*@[a-z0-9-]+(\\.[a-z0-9-]+)*(\\.[a-z]{2,4})$"

But what command or function shall I used to extract those emails?

There are no pattern in the text file. The command or function should just do a search on the document and extract the email addresses.

标签： regex r terminal command

3条回答

成全新的幸福

2楼-- · 2020-02-11 08:06

Lets take an unstructured example file:

this is a test

fred is fred@foo.com and joe is joe@example.com - but
 @this is a twitter handle for twit@here.com

Then if you do:

myText <- readLines("testmail.txt")
emails = unlist(regmatches(myText, gregexpr("([_a-z0-9-]+(\\.[_a-z0-9-]+)*@[a-z0-9-]+(\\.[a-z0-9-]+)*(\\.[a-z]{2,4}))", myText)))
> emails
[1] "fred@foo.com"    "joe@example.com" "twit@here.com"

it extracts a vector of all the emails, including when there's more than one on a line. I don't think it will find email addresses broken over line breaks, but if you paste the read lines together it might do that too:

> myText = paste(readLines("testmail.txt"),collapse=" ")
> emails = regmatches(myText, gregexpr("([_a-z0-9-]+(\\.[_a-z0-9-]+)*@[a-z0-9-]+(\\.[a-z0-9-]+)*(\\.[a-z]{2,4}))", myText))
> emails
[[1]]
[1] "fred@foo.com"    "joe@example.com" "twit@here.com"

In this case there's only one line in myText because we pasted all the lines together, so there's only one element in the returned list emails object.

Note that regex string isn't a strict definition of a valid email address. For example, it limits itself to addresses that are between 2 and 4 characters after the last dot. So it doesn't match fred@foo.fnord. There are top level domains that are longer than four characters so you may need to modify the regex.

Also, it only matches alphanumeric and dot in the name part - so valid addresses such as foo+bar@google.com won't match.

A regex that fixes these two issues might be:

 "([_+a-z0-9-]+(\\.[_+a-z0-9-]+)*@[a-z0-9-]+(\\.[a-z0-9-]+)*(\\.[a-z]{2,14}))"

but it probably has other issues and you'd be better of searching for a better email address regex online. I say better, because a perfect one doesn't exist...

0人赞添加讨论(0) 举报

家丑人穷心不美

3楼-- · 2020-02-11 08:28

This can also work :

 aa <- paste(readLines("C:\\MY_FOLDER\\NOI\\file1sample.txt"),collapse = " ")
 temp <- sapply(str_extract_all(aa,"[a-z_+0-9]+\\@\\w+\\.[a-z]{2,4}"), function(x){ paste(x,collapse = " ")})

0人赞添加讨论(0) 举报

Summer. ? 凉城

4楼-- · 2020-02-11 08:29

Read your file into R and use grep.

myText <- readLines("your.file")
Emails <- grep("^[_a-z0-9-]+(\\.[_a-z0-9-]+)*@[a-z0-9-]+(\\.[a-z0-9-]+)*(\\.[a-z]{2,4})$", myText, value=T)

This will return the whole line that the email appears on, if there is other information on that line you will need to split it up first using something like strsplit

0人赞添加讨论(0) 举报

How to extract expression matching an email addres

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间