Scala: Find edit-distance for all elements of a li

2019-09-15 07:16发布

问题:

In my previous question I was asking for advice on an algorithm to compare all elements in huge list: Scala: Compare all elements in a huge list

A more general problem I am facing and would be grateful to get some advice is to do approximate comparison of list elements for a list not fitting into memory all at once. I am building this list from SQL request that returns a cursor to iterate a single string field of about 70 000 000 records. I need to find edit-distance (http://en.wikipedia.org/wiki/Edit_distance) between every two string elements in this list.

My idea here is to use sliding window of N records to compare all 70 000 000 records:

  1. Read N elements into a list that nicely fit into memory (N ~ 10 000)
  2. Calculate edit-distance for all elements in this list using algorithm described here: Scala: Compare all elements in a huge list
  3. Read next N elements (from N to 2N-1) into a new list. Compare all these as in 2.
  4. Rewind SQL query cursor to the first record
  5. Compare every string starting from index 0 to N with all strings in this new list using the same algorithm as in 2.
  6. Slide window to read strings form 2N to 3N-1 records into a new list
  7. Compare every string starting from index 0 to 2N with all strings in this new list using the same algorithm as in 2.
  8. etc.

All comparison results I need to write into DB as (String, String, Distance) records where first two elements are strings to match and third is a result.

Questions:

  1. How to force Scala to garbage collect unneeded lists from the previous steps of this algorithm?
  2. This algorithm is awful in terms of number of calculations required to do the job. Any other algorithms, ideas on how to reduce complexity?

Thanks!