All Longest Common Subsequences

2019-07-26 14:23发布

问题:

[NOTE: I searched beforehand and couldn't find advice on solving the LCS problem for all subsequences.]

I'm having trouble coding up a solution to the "longest common subsequence" problem where I return all the longest common subsequences for two input Strings. I looked at the Wikipedia page and tried to implement the pseudo-code on there, but ran into a problem with my "backtrackAll" method. I believe I'm computing the LCS matrix correctly below, but my "backtrackAll" method returns an empty set. Any tips on what I'm doing wrong?

public static void main (String[] args) {
    String s1 = "AGCAT";
    String s2 = "GAC";
    int[][] matrix = computeMatrix(s1,s2);
    HashSet<String> set = backtrackAll(matrix,s1,s2,s1.length(),s2.length());
    //more stuff would go here...
}

private static int[][] computeMatrix(String s1, String s2) {
    int[][] C = new int[s1.length()+1][s2.length()+1];
    for (int i = 1; i < s1.length()+1; i++) {
        for (int j = 1; j < s2.length()+1; j++) {
            if (s1.charAt(i-1) == s2.charAt(j-1)) {
                C[i][j] = C[i-1][j-1] + 1;
            } else {
                C[i][j] = Math.max(C[i][j-1], C[i-1][j]);
            }
        }
    }
    return C;
}

private static HashSet<String> backtrackAll(int[][] C, String s1, String s2, int i, int j) {
    if (i == 0 || j == 0) {
        return new HashSet<String>();
    } else if (s1.charAt(i-1) == s2.charAt(j-1)) {
        HashSet<String> R = backtrackAll(C,s1,s2,i-1,j-1);
        HashSet<String> new_set = new HashSet<String>();
        for (String Z: R) {
            new_set.add(Z + s1.charAt(i-1));
        }
        return new_set;
    } else {
        HashSet<String> R = new HashSet<String>();
        if (C[i][j-1] >= C[i-1][j]) {
            R = backtrackAll(C, s1, s2, i, j-1);
        } 
        if (C[i-1][j] >= C[i][j-1]) {
            R.addAll(backtrackAll(C,s1,s2,i-1,j));
        }
        return R;
    }
}

回答1:

I modified it a bit. It now works. You should also consider when a null HashSet is returned in whose case the last matched character has to be added.

private static HashSet<String> backtrackAll(int[][] C, String s1, String s2, int i, int j) {
    // System.out.println(i+" " + j);
    if (i == 0 || j == 0) {
        // System.out.println("0t");
        return new HashSet<String>();
    } else if (s1.charAt(i - 1) == s2.charAt(j - 1)) {
        // System.out.println("2t");
        HashSet<String> R = backtrackAll(C, s1, s2, i - 1, j - 1);
        HashSet<String> new_set = new HashSet<String>();

        for (String Z : R) {
            // System.out.println("1t");
            new_set.add(Z + s1.charAt(i - 1));
        }
        new_set.add("" + s1.charAt(i - 1));
        return new_set;
    } else {
        // System.out.println("3t");
        HashSet<String> R = new HashSet<String>();
        if (C[i][j - 1] >= C[i - 1][j]) {
            R = backtrackAll(C, s1, s2, i, j - 1);
        }
        if (C[i - 1][j] >= C[i][j - 1]) {
            R.addAll(backtrackAll(C, s1, s2, i - 1, j));
        }
        return R;
    }
}


回答2:

Since this is "homework", here are a couple of tips.

  1. Make sure that you understand the algorithm that you have coded. This is probably the single most important step in figuring out what is wrong with your implementation.

  2. Try using a debugger to figure out what is going on. Compare what you think should be happening with what is actually happening.

  3. Try adding some assert statements to the code to check preconditions, postconditions and invariants that you believe should hold true. (Run with java -ea ...)

  4. Stick to the normal Java naming conventions. Variable names start with a lowercase letter. No underscores in variable names.



回答3:

The second answer prints everything but not only the longest, mine is correct.

private static HashSet<String> backtrackAll(int[][] C, String s1, String s2, int i, int j) {
    if (i == 0 || j == 0) {
        HashSet<String> set = new HashSet<String>();
        set.add("");
        return set;

    } else if (s1.charAt(i - 1) == s2.charAt(j - 1)) {
        HashSet<String> R = backtrackAll(C, s1, s2, i - 1, j - 1);
        HashSet<String> new_set = new HashSet<String>();

        for (String Z : R) {
            new_set.add(Z + s1.charAt(i - 1));
        }
        return new_set;
    } else {
        HashSet<String> R = new HashSet<String>();
        if (C[i][j - 1] >= C[i - 1][j]) {
            R = backtrackAll(C, s1, s2, i, j - 1);
        }
        if (C[i - 1][j] >= C[i][j - 1]) {
            R.addAll(backtrackAll(C, s1, s2, i - 1, j));
        }
        return R;
    }
}


回答4:

Here are two versions in C# to get to the longest common subsequences (you may refer to: http://codingworkout.blogspot.com/2014/07/longest-common-subsequence.html)

  1. Backtracking based on cached table which contains the length of longest common subsequences

  2. While caching, instead of caching, legths, capturing the lcs itself.

Version 1 (backtracking based on lcs length of prefixes):

string[] GetLongestCommonSubsequences(string A, string B, int aIndex, int bIndex, 
        int[][] DP_LCS_AllPrefixes_Cache)
    {
        if(DP_LCS_AllPrefixes_Cache[aIndex][bIndex] == 0)
        {
            return null;
        }
        if(A[aIndex-1] == B[bIndex -1])
        {
            var r = this.GetLongestCommonSubsequences(A, B, aIndex - 1, bIndex - 1,
                DP_LCS_AllPrefixes_Cache);
            if(r == null)
            {
                return new string[] { A[aIndex - 1].ToString() };
            }
            return r.Select(s => s + A[aIndex - 1].ToString()).ToArray();
        }
        int lcs_up_direction = DP_LCS_AllPrefixes_Cache[aIndex - 1][bIndex];
        int lcs_left_direction = DP_LCS_AllPrefixes_Cache[aIndex][bIndex-1];
        if(lcs_up_direction == lcs_left_direction)
        {
            string[] lcs_up = this.GetLongestCommonSubsequences(A, B, aIndex - 1, bIndex,
                DP_LCS_AllPrefixes_Cache);
            string[] lcs_left = this.GetLongestCommonSubsequences(A, B, aIndex, bIndex-1,
                DP_LCS_AllPrefixes_Cache);
            return lcs_up.Union(lcs_left).ToArray();
        }
        if(lcs_up_direction > lcs_left_direction)
        {
            return this.GetLongestCommonSubsequences(A, B, aIndex - 1, bIndex, 
                DP_LCS_AllPrefixes_Cache);
        }
        return this.GetLongestCommonSubsequences(A, B, aIndex, bIndex - 1, DP_LCS_AllPrefixes_Cache);
    }

**where the caller of the recursive function is **

string[] GetLongestCommonSubsequences(string A, string B, int[][] DP_LCS_AllPrefixes_Cache)
        {
            var r = this.GetLongestCommonSubsequences(A, B, A.Length, B.Length,
                DP_LCS_AllPrefixes_Cache);
            return r;
        }

version 2 - where the cache captures the lcs of all prefixes

class LCS_Prefix
        {
            public int Length = 0;
            public string[] Subsequences = null;
        }
        LCS_Prefix[][] LCS_OfAllPrefixes_Subsequences(string A, string B)
        {
            A.ThrowIfNullOrWhiteSpace("a");
            B.ThrowIfNullOrWhiteSpace("b");
            LCS_Prefix[][] LCS_DP_OfAllPrefixes_Subsequences_Cache = new LCS_Prefix[A.Length + 1][];
            for (int i = 0; i < LCS_DP_OfAllPrefixes_Subsequences_Cache.Length; i++)
            {
                LCS_DP_OfAllPrefixes_Subsequences_Cache[i] = new LCS_Prefix[B.Length + 1];
                for(int j = 0; j< LCS_DP_OfAllPrefixes_Subsequences_Cache[i].Length; j++)
                {
                    LCS_DP_OfAllPrefixes_Subsequences_Cache[i][j] = new LCS_Prefix();
                }
            }
            for (int rowIndexOfCache = 1; rowIndexOfCache <= A.Length; rowIndexOfCache++)
            {
                for (int columnIndexOfCache = 1; columnIndexOfCache <= B.Length; columnIndexOfCache++)
                {
                    //LCS(Ai, Bj) = 0 if i <=0, or j <= 0
                    //              LCS(Ai, Bj) + 1 if Ai == Bj
                    //              Max(LCS(Ai-1, Bj), LCS(Ai, Bj-1))
                    LCS_Prefix lcsPrefix = LCS_DP_OfAllPrefixes_Subsequences_Cache[rowIndexOfCache][columnIndexOfCache];
                    if (A[rowIndexOfCache - 1] == B[columnIndexOfCache - 1])
                    {
                        var lcs_Prefix_Diagnoal = LCS_DP_OfAllPrefixes_Subsequences_Cache[rowIndexOfCache - 1]
                            [columnIndexOfCache - 1];
                        lcsPrefix.Length = lcs_Prefix_Diagnoal.Length + 1;
                        if (lcs_Prefix_Diagnoal.Subsequences == null)
                        {
                            lcsPrefix.Subsequences = new string[] { A[rowIndexOfCache - 1].ToString() };
                        }
                        else
                        {
                            lcsPrefix.Subsequences = lcs_Prefix_Diagnoal.Subsequences
                                .Select(s => s + A[rowIndexOfCache - 1]).ToArray();
                        }
                    }
                    else
                    {
                        LCS_Prefix prefix1_Upward = LCS_DP_OfAllPrefixes_Subsequences_Cache[rowIndexOfCache - 1][columnIndexOfCache];
                        var prefix2_Leftward = LCS_DP_OfAllPrefixes_Subsequences_Cache[rowIndexOfCache][columnIndexOfCache-1];
                        if(prefix1_Upward.Length == prefix2_Leftward.Length)
                        {
                            Assert.IsTrue(prefix1_Upward.Length == prefix2_Leftward.Length);
                            Assert.IsTrue((prefix1_Upward.Subsequences == null &&
                                            prefix2_Leftward.Subsequences == null)
                                        || (prefix1_Upward.Subsequences != null
                                            && prefix2_Leftward.Subsequences != null));
                            if (prefix1_Upward.Subsequences != null)
                            {
                                Assert.IsTrue(prefix1_Upward.Subsequences.All(s1 => prefix2_Leftward.Subsequences.Any(s2 => (s2.Length == s1.Length))));
                            }

                            lcsPrefix.Length = prefix1_Upward.Length;
                            if (prefix1_Upward.Subsequences != null)
                            {
                                lcsPrefix.Subsequences = prefix1_Upward.Subsequences
                                    .Union(prefix2_Leftward.Subsequences).ToArray();
                            }
                            else
                            {
                                Assert.IsNull(prefix2_Leftward.Subsequences);
                            }
                        }
                        else if(prefix1_Upward.Length > prefix2_Leftward.Length)
                        {
                            lcsPrefix.Length = prefix1_Upward.Length;
                            lcsPrefix.Subsequences = prefix1_Upward.Subsequences;
                        }
                        else
                        {
                            lcsPrefix.Length = prefix2_Leftward.Length;
                            lcsPrefix.Subsequences = prefix2_Leftward.Subsequences;
                        }
                    }
                }
            }
            return LCS_DP_OfAllPrefixes_Subsequences_Cache;
        }

Unit Tests

[TestMethod]
        public void LCS_Tests()
        {
            string A = "AGCAT", B = "GAC";
            var DP_LCS_AllPrefixes_Cache = this.LCS_OfAllPrefixes_Length(A, B);
            Assert.IsTrue(DP_LCS_AllPrefixes_Cache[A.Length][B.Length] == 2);
            var lcs_sequences = this.GetLongestCommonSubsequences(A, B, DP_LCS_AllPrefixes_Cache);
            Assert.IsNotNull(lcs_sequences);
            Assert.IsTrue(lcs_sequences.Any(s => "AC".Equals(s)));
            Assert.IsTrue(lcs_sequences.Any(s => "GC".Equals(s)));
            Assert.IsTrue(lcs_sequences.Any(s => "GA".Equals(s)));
            var DP_LCS_AllPrefixes_Subsequences_Cache = this.LCS_OfAllPrefixes_Subsequences(A, B);
            Assert.IsTrue(DP_LCS_AllPrefixes_Subsequences_Cache[A.Length][B.Length].Length == 2);
            Assert.IsTrue(DP_LCS_AllPrefixes_Subsequences_Cache[A.Length][B.Length].Subsequences
               .Any(s => "AC".Equals(s)));
            Assert.IsTrue(DP_LCS_AllPrefixes_Subsequences_Cache[A.Length][B.Length].Subsequences
                .Any(s => "GC".Equals(s)));
            Assert.IsTrue(DP_LCS_AllPrefixes_Subsequences_Cache[A.Length][B.Length].Subsequences
                .Any(s => "GA".Equals(s)));
            A = "ABCDGH"; B = "AEDFHR";
            DP_LCS_AllPrefixes_Cache = this.LCS_OfAllPrefixes_Length(A, B);
            Assert.IsTrue(DP_LCS_AllPrefixes_Cache[A.Length][B.Length] == 3);
            lcs_sequences = this.GetLongestCommonSubsequences(A, B, DP_LCS_AllPrefixes_Cache);
            Assert.IsNotNull(lcs_sequences);
            Assert.IsTrue(lcs_sequences.Any(s => "ADH".Equals(s)));
            DP_LCS_AllPrefixes_Subsequences_Cache = this.LCS_OfAllPrefixes_Subsequences(A, B);
            Assert.IsTrue(DP_LCS_AllPrefixes_Subsequences_Cache[A.Length][B.Length].Length == 3);
            Assert.IsTrue(DP_LCS_AllPrefixes_Subsequences_Cache[A.Length][B.Length].Subsequences
                .Any(s => "ADH".Equals(s)));
            A = "AGGTAB"; B = "GXTXAYB";
            DP_LCS_AllPrefixes_Cache = this.LCS_OfAllPrefixes_Length(A, B);
            Assert.IsTrue(DP_LCS_AllPrefixes_Cache[A.Length][B.Length] == 4);
            lcs_sequences = this.GetLongestCommonSubsequences(A, B, DP_LCS_AllPrefixes_Cache);
            Assert.IsNotNull(lcs_sequences);
            Assert.IsTrue(lcs_sequences.Any(s => "GTAB".Equals(s)));
            DP_LCS_AllPrefixes_Subsequences_Cache = this.LCS_OfAllPrefixes_Subsequences(A, B);
            Assert.IsTrue(DP_LCS_AllPrefixes_Subsequences_Cache[A.Length][B.Length].Length == 4);
            Assert.IsTrue(DP_LCS_AllPrefixes_Subsequences_Cache[A.Length][B.Length].Subsequences
                .Any(s => "GTAB".Equals(s)));
        }


标签: java lcs