I want to use Lucene (in particular, Lucene.NET) to search for email address domains.
E.g. I want to search for "@gmail.com" to find all emails sent to a gmail address.
Running a Lucene query for "*@gmail.com" results in an error, asterisks cannot be at the start of queries. Running a query for "@gmail.com" doesn't return any matches, because "foo@gmail.com" is seen as a whole word, and you cannot search for just parts of a word.
How can I do this?
I see you have your solution, but mine would have avoided this and added a field to the documents you're indexing called email_domain, into which I would have added the parsed out domain of the email address. It might sound silly, but the amount of storage associated with this is pretty minimal. If you feel like getting fancier, say some domain had many subdomains, you could instead make a field into which the reversed domain went, so you'd store com.gmail, com.company.department, or ae.eim so you could find all the United Arab Emirates related addresses with a prefix query of 'ae.'
You could a separate field that indexes the email address reversed: Index 'foo@gmail.com' as 'moc.liamg@oof' Which enables you to do a query for "moc.liamg@*"
No one gave a satisfactory answer, so we started poking around Lucene documentation and discovered we can accomplish this using custom Analyzers and Tokenizers.
The answer is this: create a WhitespaceAndAtSymbolTokenizer and a WhitespaceAndAtSymbolAnalyzer, then recreate your index using this analyzer. Once you do this, a search for "@gmail.com" will return all gmail addresses, because it's seen as a separate word thanks to the Tokenizer we just created.
Here's the source code, it's actually very simple:
That's it! Now you just need to rebuild your index and do all searches using this new Analyzer. For example, to write documents to your index:
Performing searches should use the analyzer as well:
There also is setAllowLeadingWildcard
But be careful. This could get very performance expensive (thats why it is disabled by default). Maybe in some cases this would be an easy solution, but I would prefer a custom Tokenizer as stated by Judah Himango, too.