Speed up text comparisons (feature vectors) with s

2019-05-31 23:35发布

问题:

I have a function which takes two arrays containing the tokens/words of two texts and gives out the cosine similarity value which shows the relationship between both texts.

The function takes an array $tokensA (0=>house, 1=>bike, 2=>man) and an array $tokensB (0=>bike, 1=>house, 2=>car) and calculates the similarity which is given back as a floating point value.

function cosineSimilarity($tokensA, $tokensB) {
    $a = $b = $c = 0;
    $uniqueTokensA = $uniqueTokensB = array();
    $uniqueMergedTokens = array_unique(array_merge($tokensA, $tokensB));
    foreach ($tokensA as $token) $uniqueTokensA[$token] = 0;
    foreach ($tokensB as $token) $uniqueTokensB[$token] = 0;
    foreach ($uniqueMergedTokens as $token) {
        $x = isset($uniqueTokensA[$token]) ? 1 : 0;
        $y = isset($uniqueTokensB[$token]) ? 1 : 0;
        $a += $x * $y;
        $b += $x;
        $c += $y;
    }
    return $b * $c != 0 ? $a / sqrt($b * $c) : 0;
}

If I want to compare 75 texts with each other, I need to make 5,625 single comparisons to have all texts compared with each other.

Is it possible to use MySQL's spatial columns to reduce the number of comparisons?

I don't want to talk about my function or about ways to compare texts. Just about reducing the number of comparisons.

MySQL's spatial columns

  • You create spatial columns with: CREATE TABLE abc (clmnName TYPE)
  • possible types are listed here
  • here is how I select the data later [e.g. MultiPointFromText() or AsText()]
  • You insert values like this: INSERT INTO clmnName VALUES (GeomFromText('POINT(1 1)'))

But how do you use this for my problem?

PS: I'm looking for ways to reduce the number of comparisons with algorithms in this question. Vinko Vrsalovic told me that I should open another question for the spatial features.

回答1:

While R-Trees in general can index data with arbitrary number of dimensions, MySQL spatial abilities are only limited to Geometry types (2 dimensions).

If your vectors are 2-dimensional and you can normalize them, then do the following:

  • Split the circle into twice the number of angles which fit your differences
  • Find the MBR of vectors with given cosine difference from the center of each sector
  • Find all vectors within the MBR
  • Do the fine filtering for exact difference.

In this case, however, it will be better just to precaculate the angle of the value and index it with a plain B-Tree index.



回答2:

In fact you have only 75 * 74 / 2 = 2775 comparisons. You compare every word with 74 others, but you don't need to compare word1 with word2 and again word2 with word1. So it gives half of comparisons less.