In my previous question I was asking for advice on an algorithm to compare all elements in huge list: Scala: Compare all elements in a huge list
A more general problem I am facing and would be grateful to get some advice is to do approximate comparison of list elements for a list not fitting into memory all at once. I am building this list from SQL request that returns a cursor to iterate a single string field of about 70 000 000 records. I need to find edit-distance (http://en.wikipedia.org/wiki/Edit_distance) between every two string elements in this list.
My idea here is to use sliding window of N records to compare all 70 000 000 records:
- Read N elements into a list that nicely fit into memory (N ~ 10 000)
- Calculate edit-distance for all elements in this list using algorithm described here: Scala: Compare all elements in a huge list
- Read next N elements (from N to 2N-1) into a new list. Compare all these as in 2.
- Rewind SQL query cursor to the first record
- Compare every string starting from index 0 to N with all strings in this new list using the same algorithm as in 2.
- Slide window to read strings form 2N to 3N-1 records into a new list
- Compare every string starting from index 0 to 2N with all strings in this new list using the same algorithm as in 2.
- etc.
All comparison results I need to write into DB as (String, String, Distance) records where first two elements are strings to match and third is a result.
Questions:
- How to force Scala to garbage collect unneeded lists from the previous steps of this algorithm?
- This algorithm is awful in terms of number of calculations required to do the job. Any other algorithms, ideas on how to reduce complexity?
Thanks!