Let's say I have a list of domain names that I would like to analyze. Unless the domain name is hyphenated, I don't see a particularly easy way to "extract" the keywords used in the domain. Yet I see it done on sites such as DomainTools.com, Estibot.com, etc. For example:
ilikecheese.com becomes "i like cheese"
sanfranciscohotels.com becomes "san francisco hotels"
...
Any suggestions for accomplishing this efficiently and effectively?
Edit: I'd like to write this in PHP.
If you have a list of valid words, you can loop through your domain string, and try to cut off a valid word each time with a backtracking algorithm. If you managed to use up all words, you are finished. Be aware that the time-complexity of this is not optimal :)
as a simple start with pspell. you might want to compare results and see if you got the stemm of a words without the "s" at the end and merge them.
Ok, I ran the script I wrote for this SO question, with a couple of minor changes -- using log probabilities to avoid underflow, and modifying it to read multiple files as the corpus.
For my corpus I downloaded a bunch of files from project Gutenberg -- no real method to this, just grabbed all english-language files from etext00, etext01, and etext02.
Below are the results, I saved the top three for each combination.
Might want to check out this SO question.
You would have to use a dictionary engine against a domain entry to find valid words and the run that dictionary engine against the result to ensure the result is valid words.
choosespain.com kidsexpress.com childrenswear.com dicksonweb.com
Have fun (and a good lawyer) if you are going to try to parse the url with a dictionary.
You might do better if you can find the same characters but separated by white space on their web site.
Other possiblities: extract data from ssl certificate; query top level domain name server; Access the domain name server (TLD); or use one of the "whois" tools or services (just google "whois").