Dynamic Programming: Longest Common Subsequence

2019-08-15 06:25发布

问题:

I'm going over notes that discuss dynamic programming in the context of finding the longest common subsequence of two equal-length strings. The algorithm in question outputs the length (not the substring).

So I have two strings, say:

S = ABAZDC, T = BACBAD

Longest common subsequence is ABAD (substrings don't have to be adjacent letters)

Algorithm is as follows, where LCS[i, j] denotes longest common subsequence of S[1..i] and T[1..j]:

if S[i] != T[j] then
    LCS[i, j] = max(LCS[i - 1, j], LCS[i, j - 1])
else
    LCS[i, j] = 1 + LCS[i - 1, j - 1]

My notes claim you can fill out a table where each string is written along an axis. Something like:

B A C B A D

A 0 1 1 1 1 1

B 1 1 1 2 2 2

A ...

Z

D

C

Two questions:

1) How do we actually start filling this table out. Algorithm is recursive but doesn't seem to provide a base case (otherwise I'd just call LCS[5, 5])? Notes claim you can do two simple loops with i and j and fill out each spot in constant time...

2) How could we modify the algorithm so the longest common subsequence would be of adjacent letters? My thought is that I'd have to reset the length of the current subsequence to 0 once I find that the next letter in S doesn't match the next letter in T. But it's tricky because I want to keep track of the longest seen thus far (It's possible the first subsequence I see is the longest one). So maybe I'd have an extra argument, longestThusFar, that is 0 when we call our algorithm initially and changes in subsequent calls.

Can someone make this a bit more rigorous?

Thanks!

回答1:

Firstly,the algorithm is recursive,but the implementation is always iterative.In other words,we do not explicitly,call the same function from the function itself(Recursion).

We use the table entries already filled to compensate for the recursion.

Say,you have two strings of length M.

Then a table is defined of dimensions (M+1)X(M+1).

for(i = 0 to M)
{
  LCS[0][i]=0;
}
for(i = 1 to M)
{
  LCS[i][0]=0;
}

And you get a table like

    B,A,C,B,A,D
  0,0,0,0,0,0,0
A 0
B 0
A 0
Z 0
D 0
C 0

Each zero in 0th col means that if no character of string BACBAD is considered,then length of LCS = 0.

Each zero in 0th row means that if no character of string ABAZDC is considered,then length of LCS = 0.

The rest of entries are filled using the rules as you mentioned.

for(i = 1 to M)
{
 for(j = 1 to M)
 {
  if S[i-1] != T[j-1] then
    LCS[i, j] = max(LCS[i - 1, j], LCS[i, j - 1])
  else
    LCS[i, j] = 1 + LCS[i - 1, j - 1]
 }
}

Notice that its S[i-1] != T[j-1] and not S[i] != T[j] because when you fill LCS[i,j], you are always comparing i-1 th char of S and j-1 th char of T.

The length of LCS is given by LCS[M,M].

The best way to understand this is try it by hand.

In answer to your second question,You do not need to modify the algo much.

the solution lies in the table that is used to retrieve the LCS.

In order to retrieve the LCS, we make an extra table T of characters of dimensions MXM. and we modify the algo as follows.

for(i = 1 to M)
{
 for(j = 1 to M)
 {
  if S[i-1] != T[j-1] then
  {
     LCS[i, j] = max(LCS[i - 1, j], LCS[i, j - 1])
     if(LCS[i - 1, j]>=LCS[i, j - 1])          
      T[i-1][j-1]='u'//meaning up
     else T[i-1][j-1]='l'//meaning left
  }
  else
  {
    LCS[i, j] = 1 + LCS[i - 1, j - 1]
    T[i-1][j-1]='d'//meaning diagonally up
  }
 }
}

Now,in order to know longest substring(OF ADJACENT LETTERS) common to both,traverse T diagonally.

The length = largest number of consecutive d's found in a diagonal.

The diagonal traversal of any square matrix NXN is done by.

Lower Triangle including the main diagonal

j=N-1
while(j>=0)
{
 i=j;k=0;
 while(i <= N-1)
 {
  entry T[i][k];
  ++i;++k
 }
 --j;
}

Upper triangle

j=1;
while(j<=N-1)
{
 i=j;k=0;
 while(i<=N-1)
 {
  entry T[k][i];
  ++k;++i;
 }
 --j;
}