php : word proximity script?

2020-05-09 09:37发布

问题:

Okay - so, I've spent ages searching in Google, and even went through a few specific searches at hotscripts etc., several php forums and this place ... nothing (not of use anyway).

i want to be able to take a block of text (page/file/doc) and pull it apart to find the "distance" between specific terms (find the proximity/raltional distance etc.).

I woudl have thought there'd be at least a few such thigns around - but I'm not finding them. So - it may be harder than I thought. I understand it may be a somewhat "hungry" endevour - as it's likely to be fairly intensive on large documents - but surely it is possible?

Infact - whilst looking around - the majority of references that I find (apart from lamo-repeat SEO sites) seems to suggest advanced linguistic studies, strange/advanced packages to install onto a server etc.

Am I to assume that "proximity" is infact a highly complex issue, and will require serious resources and an awful lot of development? (Honestly - in my mind it seems somewhat moderate - so I'm wondering exactly what it is I'm missing (Note: Simple in a relative sense ... I would compare it to easy (density/count) through to difficult(word stemming/base/thesaurusing)).

So - references/suggestions/ideas/thoughts???

回答1:

I also thought of Hamming distance as Felix Kling commented. Maybe you can make some variant, where you encode your words into specific codewords and then check their distances through an array that holds your codewords.

So if you have array[11, 02, 85, 37, 11], you can easily find that 11 has a maximum distance of 4 in this array.

Don't know if this would work for you, but i think i would do it in a similar manner.



回答2:

If you are speaking about specific word comparisons, you will want to look at the SOUNDEX function of MySQL. (I will assume you may be using mysql). When comparing two words, you can get a reference to how they sound:

SELECT `word` FROM `list_of_words` WHERE SOUNDEX(`word`) = SOUNDEX('{TEST_WORD}');

Then when you get your list of words (as most likely you will get quite a few), you cna check the distance between those words for the word that is CLOSEST (or the group of words depending on how you write your code).

$word = '{WORD TO CHECK}';
$distance = 4; // the smalled the distance the closed the word
foreach($word_results as $comparison_word) {
   $distance = levenshtein($comparison_word, $word);
   if($distance < $threshold) {
      $threshold = $distance;
      $similar_word = $comparison_word;
   }
}
echo $similar_word;

Hope that helps you find the direction you are looking for.

Happy coding!



回答3:

your example searched Word1 ... Word2, should Word2 ... Word1 also be matched? A simple solution is to use RegEx:

i.e.:

  1. use regex: \bWord1\b(.*)\bWord2\b
  2. in the first match group, use space (or whatever boundary) to split it into an array, and count

this is the most straight forward method, but definitely not the best (i.e. performance wise) method. I think you need to clarify your needs if you want a more specific answer.

Update:

After the 2 questions are merged, I see other answers mentioning soundex, levinstein and hamming distance etc. I would suggest theclueless1 to CLARIFY the requirements so that people can give useful help. If this is an application related to searching or document clustering, I also suggest you to take a look at mature full text indexing/searching solutions such as sphinx or lucene. I think any of them can be used with PHP.