Match strings from one numpy array to another

2020-06-21 06:03发布

问题:

Hi I am working with python 3 and I've been facing this issue for a while now and I can't seem to figure this out.

I have 2 numpy arrays containing strings

array_one = np.array(['alice', 'in', 'a', 'wonder', 'land', 'alice in', 'in a', 'a wonder', 'wonder land', 'alice in a', 'in a wonder', 'a wonder land', 'alice in a wonder', 'in a wonder land', 'alice in a wonder land'])

If you notice, the array_one is actually an array containing 1-gram, 2-gram, 3-gram, 4-gram, 5-gram for the sentence alice in a wonder land.

I purposefully have taken wonderland as two words wonder and land.

Now I have another numpy array that contains some locations and names.

array_two = np.array(['new york', 'las vegas', 'wonderland', 'florida'])

Now what I want to do is get all the elements in the array_one that exist in array_two.

If I take out an intersection using np.intersect1d of the two arrays I don't get any matches since wonderland is two separate words in array_one while in array_two it's a single word.

Is there any way to do this? I've tried solutions from stack (this) but they don't seem to work with python 3

array_one would at max have 60-100 items while array_two would at max have roughly 1 million items but an average of 250,000 - 500,000 items.


Edit

I've used a very naive approach since I wasn't able to find a solution uptill now, I replaced white space from both arrays and then using the resultant boolean array ([True, False, True]) to `filter on the origional array. Below is the code:

import numpy.core.defchararray as np_f
import numpy as np


array_two_wr = np_f.replace(array_two, ' ', '')
array_one_wr = np_f.replace(array_one, ' ', '')
intersections = array_two[np.in1d(array_two_wr, array_one_wr)]

But I am not sure this is the way to go considering the number of elements in array_two

回答1:

Sorry to post two answers, but after adding the locality-sensitive-hashing technique above, I realized you could exploit the class separation in your data (query vectors and potential matching vectors) by using a bloom filter.

A bloom filter is a beautiful object that lets you pass in some objects, then query to see whether a given object has been added to the bloom filter. Here's an awesome visual demo of a bloom filter.

In your case we can add each member of array_two to the bloom filter, then query each member of array_one to see whether it's in the bloom filter. Using pip install bloom-filter:

from bloom_filter import BloomFilter # pip instal bloom-filter
import numpy as np
import re

def clean(s):
  '''Clean a string'''
  return re.sub(r'\s+', '', s)

array_one = np.array(['alice', 'in', 'a', 'wonder', 'land', 'alice in', 'in a', 'a wonder', 'wonder land', 'alice in a', 'in a wonder', 'a wonder land', 'alice in a wonder', 'in a wonder land', 'alice in a wonder land'])
array_two = np.array(['new york', 'las vegas', 'wonderland', 'florida'])

# initialize bloom filter with particular size
bloom = BloomFilter(max_elements=10000, error_rate=0.1)
# add each member of array_two to bloom filter
[bloom.add(clean(i)) for i in array_two]
# find the members in array_one in array_two
matches = [i for i in array_one if clean(i) in bloom]
print(matches)

Result: ['wonder land']

Depending on your requirements, this could be a very efficient (and highly-scalable) solution.



回答2:

Minhashing could definitely be used here. Here's the very general idea behind minhashing: for each object in a list, hash the object multiple times, and update an object that keeps track of the hashes computed for each list member. Then examine the set of resulting hashes, and for each, find all objects for which that hash was computed (we just stored this data). The objects for which the same hash was computed will be very similar if the hashing function is chosen carefully.

For a more detailed explanation of minhashing, see Chapter 3 of Mining Massive Datasets.

Here's a sample Python 3 implementation using your data and datasketch (pip install datasketch), which computes the hashes:

import numpy as np
from datasketch import MinHash, MinHashLSH
from nltk import ngrams

def build_minhash(s):
  '''Given a string `s` build and return a minhash for that string'''
  new_minhash = MinHash(num_perm=256)
  # hash each 3-character gram in `s`
  for chargram in ngrams(s, 3):
    new_minhash.update(''.join(chargram).encode('utf8'))
  return new_minhash

array_one = np.array(['alice', 'in', 'a', 'wonder', 'land', 'alice in', 'in a', 'a wonder', 'wonder land', 'alice in a', 'in a wonder', 'a wonder land', 'alice in a wonder', 'in a wonder land', 'alice in a wonder land'])
array_two = np.array(['new york', 'las vegas', 'wonderland', 'florida'])

# create a structure that lets us query for similar minhashes
lsh = MinHashLSH(threshold=0.3, num_perm=256)

# loop over the index and value of each member in array two
for idx, i in enumerate(array_two):
  # add the minhash to the lsh index
  lsh.insert(idx, build_minhash(i))

# find the items in array_one with 1+ matches in arr_two
for i in array_one:
  result = lsh.query(build_minhash(i))
  if result:
    matches = ', '.join([array_two[j] for j in result])
    print(' *', i, '--', matches)

Results (array_one member on left, array_two match on right):

 * wonder -- wonderland
 * a wonder -- wonderland
 * wonder land -- wonderland
 * a wonder land -- wonderland
 * in a wonder land -- wonderland
 * alice in a wonder land -- wonderland

The easiest way to tune for precision/recall here is by changing the threshold argument to MinHashLSH. You can also try modifying the hashing technique itself. Here I used 3-character hashes when building the minhash for each ngram, a technique Yale's Digital Humanities lab found tremendously powerful at capturing text similarity: https://github.com/YaleDHLab/intertext



标签: python numpy