longest common subsequence: why is this wrong?

2019-07-20 02:25发布

问题:

int lcs(char * A, char * B)
{
  int m = strlen(A);
  int n = strlen(B);
  int *X = malloc(m * sizeof(int));
  int *Y = malloc(n * sizeof(int));
  int i;
  int j;
  for (i = m; i >= 0; i--)
    {
      for (j = n; j >= 0; j--)
        {
          if (A[i] == '\0' || B[j] == '\0') 
              X[j] = 0;
          else if (A[i] == B[j]) 
              X[j] = 1 + Y[j+1];
          else 
              X[j] = max(Y[j], X[j+1]);
        }
      Y = X;
    }
  return X[0];
}

This works, but valgrind complains loudly about invalid reads. How was I messing up the memory? Sorry, I always fail at C memory allocation.

回答1:

The issue here is with the size of your table. Note that you're allocating space as

int *X = malloc(m * sizeof(int));
int *Y = malloc(n * sizeof(int));

However, you are using indices 0 ... m and 0 ... n, which means that there are m + 1 slots necessary in X and n + 1 slots necessary in Y.

Try changing this to read

int *X = malloc((m + 1) * sizeof(int));
int *Y = malloc((n + 1) * sizeof(int));

Hope this helps!



回答2:

Series of issues. First, as templatetypedef says, you're under-allocated.

Then, as paddy says, you're not freeing up your malloc'd memory. If you need the Y=X line, you'll need to store the original malloc'd space addresses in another set of variables so you can call free on them.

...mallocs...
int * original_y = Y;
int * original_x = X;
...body of code...
free(original_y);
free(original_x);
return X[0];

But this doesn't address your new question, which is why doesn't the code actually work?

I admit I can't follow your code (without a lot more study), but I can propose an algorithm that will work and be far more understandable. This may be somewhat pseudocode and not particularly efficient, but getting it correct is the first step. I've listed some optimizations later.

int lcs(char * A, char * B)
{
  int length_a = strlen(A);
  int length_b = strlen(B);


  // these hold the position in A of the longest common substring
  int longest_found_length = 0;

  // go through each substring of one of the strings (doesn't matter which, you could pick the shorter one if you want)
  char * candidate_substring = malloc(sizeof(char) * length_a + 1);
  for (int start_position = 0; start_position < length_a; start_position++) {
    for (int end_position = start_position; end_position < length_a; end_position++) {

       int substring_length = end_position - start_position + 1;

       // make a null-terminated copy of the substring to look for in the other string
       strncpy(candidate_substring, &(A[start_position]), substring_length);
       if (strstr(B, candidate_substring) != NULL) {
         longest_found_length = substring_length;
       }
    }

  }
  free(candidate_substring);
  return longest_found_length;
}

Some different optimizations you could do:

       // if this can't be longer, then don't bother checking it.  You can play games with the for loop to not have this happen, but it's more complicated.
       if (substring_length <= longest_found_index) {
         continue;
       }

and

       // there are more optimizations you could do to this, but don't check
       //   the substring if it's longer than b, since b can't contain it.
       if (substring_length > length_b) {
         continue;
       } 

and

   if (strstr(B, candidate_substring) != NULL) {
     longest_found_length = end_position - start_position + 1;
   } else {
     // if nothing contains the shorter string, then nothing can contain the longer one, so skip checking longer strings with the same starting character
     break; // skip out of inner loop to next iteration of start_position
   }

Instead of copying each candidate substring to a new string, you could do a character swap with the end_position + 1 and a NUL character. Then, after looking for that substring in b, swap the original character at end_position+1 back in. This would be much faster, but complicates the implementation a little.



回答3:

NOTE: I don't normally write two answers and if you feel that it is tacky, feel free to comment on this one and note vote it up. This answer is a more optimized solution, but I wanted to give the most straightforward one I could think of first and then put this in another answer to not confuse the two. Basically they are for different audiences.

The key to solving this problem efficiently is to not throw away information you have about shorter common substrings when looking for longer ones. Naively, you check each substring against the other one, but if you know that "AB" matches in "ABC", and your next character is C, don't check to see if "ABC" is in "ABC", just check that the spot after "AB" is a "C".

For each character in A, you have to check up to all the letters in B, but because we stop looking through B once a longer substring is no longer possible, it greatly limits the number of checks. Each time you get a longer match up front, you eliminate checks on the back-end, because it will no longer be a longer substring.

For example, if A and B are both long, but contain no common letters, each letter in A will be compared against each letter in B for a runtime of A*B.

For a sequence where there are a lot of matches, but the match length isn't a large fraction of the length of the shorter string, you have A * B combinations to check against the shorter of the two strings (A or B) leading to either A*B*A or A*B*B, which is basically O(n^3) time for similar length strings. I really thought the optimizations in this solution would be better than n^3 even though there are triple-nested for loops, but it appears to not be as best as I can tell.

I'm thinking about this some more, though. Either the substrings being found are NOT a significant fraction of the length of the strings, in which case the optimizations don't do much, but the comparisons for each combination of A*B don't scale with A or B and drop out to be constants -- OR -- they are a significant fraction of A and B and it directly divides against the A*B combinations that have to be compared.

I just may ask this in a question.

int lcs(char * A, char * B)
{
  int length_a = strlen(A);
  int length_b = strlen(B);

  // these hold the position in A of the longest common substring
  int longest_length_found = 0;

  // for each character in one string (doesn't matter which), look for incrementally larger strings in the other
  for (int a_index = 0; a_index < length_a - longest_length_found; a_index++) {
    for (int b_index = 0; b_index < length_b - longest_length_found; b_index++) {

       // offset into each string until end of string or non-matching character is found
      for (int offset = 0; A[a_index+offset] != '\0' && B[b_index+offset] != '\0' && A[a_index+offset] == B[b_index+offset]; offset++) {          
        longest_length_found = longest_length_found > offset ? longest_length_found : offset;
      }
    }
  }
  return longest_found_length;
}


回答4:

In addition to what templatetypedef said, some things to think about:

  • Why aren't X and Y the same size?
  • Why are you doing Y = X? That's an assignment of pointers. Did you perhaps mean memcpy(Y, X, (n+1)*sizeof(int))?