longest common substring between 2 HUGE files - ou

2019-06-14 06:03发布

问题:

I'm completely brain fried after this, I need to find the longest common substring between 2 files, a small one and a HUGE one. I don't even know where to start to begin the search, heres what I have so far

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;

public class MyString
{
    public static void main (String[] args) throws IOException
    {
        BufferedReader br = new BufferedReader(new FileReader("MobyDick.txt"));
        BufferedReader br2 = new BufferedReader(new FileReader("WarAndPeace.txt"));
        String md, wp;
        StringBuilder s = new StringBuilder();
        while ((md = br.readLine()) != null)
        {
            s.append(md).append(" ");
        }
        md = s + "";
        s.setLength(0);
        while ((wp = br2.readLine()) != null)
        {
            s.append(wp).append(" ");
        }
        wp = s + "";
        s.setLength(0);

        md = md.replaceAll("\\s+", " "); //rids of double spaces
        wp = wp.replaceAll("\\s+", " "); //rids of double spaces
    }
}

what i did so far was to put each file into a string builder, and then into a string to rid of double spaces (it came up a lot on MobyDick.txt). I found this code

public static String longestSubstring(String str1, String str2) {

StringBuilder sb = new StringBuilder();
if (str1 == null || str1.isEmpty() || str2 == null || str2.isEmpty())
  return "";

// ignore case
str1 = str1.toLowerCase();
str2 = str2.toLowerCase();

// java initializes them already with 0
int[][] num = new int[str1.length()][str2.length()];
int maxlen = 0;
int lastSubsBegin = 0;

for (int i = 0; i < str1.length(); i++) {
for (int j = 0; j < str2.length(); j++) {
if (str1.charAt(i) == str2.charAt(j)) {
if ((i == 0) || (j == 0))
   num[i][j] = 1;
else
   num[i][j] = 1 + num[i - 1][j - 1];

if (num[i][j] > maxlen) {
  maxlen = num[i][j];
  // generate substring from str1 => i
  int thisSubsBegin = i - num[i][j] + 1;
  if (lastSubsBegin == thisSubsBegin) {
     //if the current LCS is the same as the last time this block ran
     sb.append(str1.charAt(i));
  } else {
     //this block resets the string builder if a different LCS is found
     lastSubsBegin = thisSubsBegin;
     sb = new StringBuilder();
     sb.append(str1.substring(lastSubsBegin, i + 1));
  }
  }
  }
  }}

  return sb.toString();
  }

this code helps but only on small files, every time I run it with the big files I get a "out of memory: java heap space" error. I need the right algorithm to get away from the heap space issue, and no i cant increase java memory, can anyone help or point me in the right direction?

回答1:

First you need to identify exactly why this is such a memory hog, and then you can start to work around it.

This declaration jumps out as a potential problem:

int[][] num = new int[str1.length()][str2.length()];

War and Peace is over 3 million characters long, and Moby Dick is about half the length of it so we will conservatively say it's a million characters long.

You are trying to allocate space for 3,000,000,000,000 integers, each of which is 4 bytes, which works out to be 12,000,000,000,000 bytes or a bit under 11 TB.

Hopefully it's clear why the algorithm is not suited for strings of this length.

Thankfully, one of the principle theories in computer science is that you can always trade time for memory and vice versa.

Instead you want to try a generalized suffix tree. This has a memory cost of \Theta(n + m) and can be constructed in \Theta(n + m) which is much more manageable.

Here is an excellent guide to a O(n) algorithm to generate such trees.

Once you have the suffix tree in place, finding the LCS can be done in constant time by finding the deepest node in the tree whose subtree contains a substring of both input strings. A typical strategy is to mark all nodes 'v' with a flag 'i' if they satisfy the property:

The subtree with root v contains a substring of the string S_i

and then find the deepest node v where v is marked as i for all i in the range (in this case, just 0 and 1).