Using Lucene to search for email addresses

I want to use Lucene (in particular, Lucene.NET) to search for email address domains.

E.g. I want to search for "@gmail.com" to find all emails sent to a gmail address.

Running a Lucene query for "*@gmail.com" results in an error, asterisks cannot be at the start of queries. Running a query for "@gmail.com" doesn't return any matches, because "foo@gmail.com" is seen as a whole word, and you cannot search for just parts of a word.

How can I do this?

标签： .net search lucene

4条回答

做个烂人

2楼-- · 2019-01-14 00:04

I see you have your solution, but mine would have avoided this and added a field to the documents you're indexing called email_domain, into which I would have added the parsed out domain of the email address. It might sound silly, but the amount of storage associated with this is pretty minimal. If you feel like getting fancier, say some domain had many subdomains, you could instead make a field into which the reversed domain went, so you'd store com.gmail, com.company.department, or ae.eim so you could find all the United Arab Emirates related addresses with a prefix query of 'ae.'

0人赞添加讨论(0) 举报

Juvenile、少年°

3楼-- · 2019-01-14 00:08

You could a separate field that indexes the email address reversed: Index 'foo@gmail.com' as 'moc.liamg@oof' Which enables you to do a query for "moc.liamg@*"

0人赞添加讨论(0) 举报

smile是对你的礼貌

4楼-- · 2019-01-14 00:12

No one gave a satisfactory answer, so we started poking around Lucene documentation and discovered we can accomplish this using custom Analyzers and Tokenizers.

The answer is this: create a WhitespaceAndAtSymbolTokenizer and a WhitespaceAndAtSymbolAnalyzer, then recreate your index using this analyzer. Once you do this, a search for "@gmail.com" will return all gmail addresses, because it's seen as a separate word thanks to the Tokenizer we just created.

Here's the source code, it's actually very simple:

class WhitespaceAndAtSymbolTokenizer : CharTokenizer
{
    public WhitespaceAndAtSymbolTokenizer(TextReader input)
        : base(input)
    {
    }

    protected override bool IsTokenChar(char c)
    {
        // Make whitespace characters and the @ symbol be indicators of new words.
        return !(char.IsWhiteSpace(c) || c == '@');
    }
}


internal class WhitespaceAndAtSymbolAnalyzer : Analyzer
{
    public override TokenStream TokenStream(string fieldName, TextReader reader)
    {
        return new WhitespaceAndAtSymbolTokenizer(reader);
    }
}

That's it! Now you just need to rebuild your index and do all searches using this new Analyzer. For example, to write documents to your index:

IndexWriter index = new IndexWriter(indexDirectory, new WhitespaceAndAtSymbolAnalyzer());
index.AddDocument(myDocument);

Performing searches should use the analyzer as well:

IndexSearcher searcher = new IndexSearcher(indexDirectory);
Query query = new QueryParser("TheFieldNameToSearch", new WhitespaceAndAtSymbolAnalyzer()).Parse("@gmail.com");
Hits hits = query.Search(query);

0人赞添加讨论(0) 举报

Anthone

5楼-- · 2019-01-14 00:13

There also is setAllowLeadingWildcard

But be careful. This could get very performance expensive (thats why it is disabled by default). Maybe in some cases this would be an easy solution, but I would prefer a custom Tokenizer as stated by Judah Himango, too.

0人赞添加讨论(0) 举报

Using Lucene to search for email addresses

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间