Programmatically extract keywords from domain name

2019-03-09 14:30发布

Let's say I have a list of domain names that I would like to analyze. Unless the domain name is hyphenated, I don't see a particularly easy way to "extract" the keywords used in the domain. Yet I see it done on sites such as DomainTools.com, Estibot.com, etc. For example:

ilikecheese.com becomes "i like cheese"
sanfranciscohotels.com becomes "san francisco hotels"
...

Any suggestions for accomplishing this efficiently and effectively?

Edit: I'd like to write this in PHP.

7条回答
相关推荐>>
2楼-- · 2019-03-09 14:34

If you have a list of valid words, you can loop through your domain string, and try to cut off a valid word each time with a backtracking algorithm. If you managed to use up all words, you are finished. Be aware that the time-complexity of this is not optimal :)

查看更多
ら.Afraid
3楼-- · 2019-03-09 14:39
function getwords( $string ) {
    if( strpos($string,"xn--") !== false ) {
        return false;
    }
    $string = trim( str_replace( '-', '', $string ) );
    $pspell = pspell_new( 'en' );
    $check = array();
    $words = array();
    for( $j = 0; $j < ( strlen( $string ) - 5 ); $j++ ) {
        for( $i = 4; $i < strlen( $string ); $i++ ) {
            if( pspell_check( $pspell, substr( $string, $j, $i ) ) ) {
                $check[$j]++;
                $words[] = substr( $string, $j, $i );
            }
        }
    }
    $words = array_unique( $words );
    if( count( $check ) > 0 ) {
        return $words;
    }
    return false;
}

print_r( getwords( 'ilikecheesehotels' ) );

Array
(
    [0] => like
    [1] => cheese
    [2] => hotel
    [3] => hotels
)

as a simple start with pspell. you might want to compare results and see if you got the stemm of a words without the "s" at the end and merge them.

查看更多
三岁会撩人
4楼-- · 2019-03-09 14:41

Ok, I ran the script I wrote for this SO question, with a couple of minor changes -- using log probabilities to avoid underflow, and modifying it to read multiple files as the corpus.

For my corpus I downloaded a bunch of files from project Gutenberg -- no real method to this, just grabbed all english-language files from etext00, etext01, and etext02.

Below are the results, I saved the top three for each combination.

expertsexchange: 97 possibilities
 -  experts exchange -23.71
 -  expert sex change -31.46
 -  experts ex change -33.86

penisland: 11 possibilities
 -  pen island -20.54
 -  penis land -22.64
 -  pen is land -25.06

choosespain: 28 possibilities
 -  choose spain -21.17
 -  chooses pain -23.06
 -  choose spa in -29.41

kidsexpress: 15 possibilities
 -  kids express -23.56
 -  kid sex press -32.65
 -  kids ex press -34.98

childrenswear: 34 possibilities
 -  children swear -19.85
 -  childrens wear -25.26
 -  child ren swear -32.70

dicksonweb: 8 possibilities
 -  dickson web -27.09
 -  dick son web -30.51
 -  dicks on web -33.63
查看更多
劳资没心,怎么记你
5楼-- · 2019-03-09 14:42

Might want to check out this SO question.

查看更多
唯我独甜
6楼-- · 2019-03-09 14:50

You would have to use a dictionary engine against a domain entry to find valid words and the run that dictionary engine against the result to ensure the result is valid words.

查看更多
The star\"
7楼-- · 2019-03-09 14:53

choosespain.com kidsexpress.com childrenswear.com dicksonweb.com

Have fun (and a good lawyer) if you are going to try to parse the url with a dictionary.

You might do better if you can find the same characters but separated by white space on their web site.

Other possiblities: extract data from ssl certificate; query top level domain name server; Access the domain name server (TLD); or use one of the "whois" tools or services (just google "whois").

查看更多
登录 后发表回答