finding a supersequence of DNA Java

2019-04-01 00:24发布

I am struggling with a "find supersequence" algorithm.

The input is for set of strings

String A = "caagccacctacatca";
String B = "cgagccatccgtaaagttg";
String C = "agaacctgctaaatgctaga";

the result would be properly aligned set of strings (and next step should be merge)

String E = "ca ag cca  cc ta    cat  c a";
String F = "c gag ccat ccgtaaa g  tt  g";
String G = " aga acc tgc  taaatgc t a ga";

Thank you for any advice (I am sitting on this task for more than a day)

after merge the superstring would be

cagagaccatgccgtaaatgcattacga

The definition of supersequence in "this case" would be something like

The string R is contained in supersequence S if and only if all characters in a string R are present in supersequence S in the order in which they occur in the input sequence R.


The "solution" i tried (and again its the wrong way of doing it) is:

public class Solution4
{
    static  boolean[][] map = null;
    static int size = 0;

    public static void main(String[] args)
    {
        String A = "caagccacctacatca";
        String B = "cgagccatccgtaaagttg";
        String C = "agaacctgctaaatgctaga";

        Stack data = new Stack();
        data.push(A);
        data.push(B);
        data.push(C);


        Stack clone1 = data.clone();
        Stack clone2 = data.clone();

        int length  =  26;
        size        =  max_size(data);

        System.out.println(size+" "+length);
        map = new boolean[26][size];

        char[] result = new char[size];

        HashSet<String> chunks = new HashSet<String>();
        while(!clone1.isEmpty())
        {
            String a = clone1.pop();

            char[] residue = make_residue(a);

            System.out.println("---");
            System.out.println("OLD     : "+a);
            System.out.println("RESIDUE : "+String.valueOf(residue));


            String[] r = String.valueOf(residue).split(" ");

            for(int i=0; i<r.length; i++)
            {
                if(r[i].equals(" ")) continue;
                //chunks.add(spaces.substring(0,i)+r[i]);
                chunks.add(r[i]);
            }
        }

        for(String chunk : chunks)
        {
            System.out.println("CHUNK   : "+chunk);
        }
    }

    static char[] make_residue(String candidate)
    {
        char[] result = new char[size];
        for(int i=0; i<candidate.length(); i++)
        {
            int pos = find_position_for(candidate.charAt(i),i);
            for(int j=i; j<pos; j++) result[j]=' ';
            if(pos==-1) result[candidate.length()-1] = candidate.charAt(i);
            else        result[pos] = candidate.charAt(i);
        }
        return result;
    }

    static int find_position_for(char character, int offset)
    {
        character-=((int)'a');

        for(int i=offset; i<size; i++)
        {
        //  System.out.println("checking "+String.valueOf((char)(character+((int)'a')))+" at "+i);
            if(!map[character][i])
            {
                map[character][i]=true;
                return i;
            }
        }
        return -1;
    }

    static String move_right(String a, int from)
    {
        return a.substring(0, from)+" "+a.substring(from);  
    }


    static boolean taken(int character, int position)
    { return map[character][position]; }

    static void take(char character, int position)
    {
        //System.out.println("taking "+String.valueOf(character)+" at "+position+" (char_index-"+(character-((int)'a'))+")");
        map[character-((int)'a')][position]=true;
    }

    static int max_size(Stack stack)
    {
        int max=0;
        while(!stack.isEmpty())
        {
            String s = stack.pop();
            if(s.length()>max) max=s.length();
        }

        return max;
    }

}

2条回答
Bombasti
2楼-- · 2019-04-01 00:41

Finding any common supersequence is not a difficult task:

In your example possible solution would be something like:

public class SuperSequenceTest {

public static void main(String[] args) {
    String A = "caagccacctacatca";
    String B = "cgagccatccgtaaagttg";
    String C = "agaacctgctaaatgctaga";

    int iA = 0;
    int iB = 0;
    int iC = 0;

    char[] a = A.toCharArray();
    char[] b = B.toCharArray();
    char[] c = C.toCharArray();


    StringBuilder sb = new StringBuilder();

    while (iA < a.length || iB < b.length || iC < c.length) {
        if (iA < a.length && iB < b.length && iC < c.length && (a[iA] == b[iB]) && (a[iA] == c[iC])) {
            sb.append(a[iA]);
            iA++;
            iB++;
            iC++;
        }
        else if (iA < a.length && iB < b.length && a[iA] == b[iB]) {
            sb.append(a[iA]);
            iA++;
            iB++;
        }
        else if (iA < a.length && iC < c.length && a[iA] == c[iC]) {
            sb.append(a[iA]);
            iA++;
            iC++;
        }
        else if (iB < b.length && iC < c.length && b[iB] == c[iC]) {
            sb.append(b[iB]);
            iB++;
            iC++;
        } else {
            if (iC < c.length) {
                sb.append(c[iC]);
                iC++;
            }
            else if (iB < b.length) {
                sb.append(b[iB]);
                iB++;
            } else if (iA < a.length) {
                sb.append(a[iA]);
                iA++;
            }
        }
    }
    System.out.println("SUPERSEQUENCE " + sb.toString());
}

}

However the real problem to solve is to find the solution for the known problem of Shortest Common Supersequence http://en.wikipedia.org/wiki/Shortest_common_supersequence, which is not that easy.

There is a lot of researches which concern the topic.

See for instance:

http://www.csd.uwo.ca/~lila/pdfs/Towards%20a%20DNA%20solution%20to%20the%20Shortest%20Common%20Superstring%20Problem.pdf

http://www.ncbi.nlm.nih.gov/pubmed/14534185

查看更多
爷的心禁止访问
3楼-- · 2019-04-01 00:47

You can try finding the shortest combination like this

static final char[] CHARS = "acgt".toCharArray();

public static void main(String[] ignored) {
    String A = "caagccacctacatca";
    String B = "cgagccatccgtaaagttg";
    String C = "agaacctgctaaatgctaga";
    String expected = "cagagaccatgccgtaaatgcattacga";

    List<String> ABC = new Combination(A, B, C).findShortest();
    System.out.println("expected: " + expected.length());
    System.out.println("Merged: " + ABC.get(0).length() + " " + ABC);
}

static class Combination {
    int shortest = Integer.MAX_VALUE;
    List<String> shortestStr = new ArrayList<>();
    char[][] chars;
    int[] pos;
    int count = 0;

    Combination(String... strs) {
        chars = new char[strs.length][];
        pos = new int[strs.length];
        for (int i = 0; i < strs.length; i++) {
            chars[i] = strs[i].toCharArray();
        }
    }

    public List<String> findShortest() {
        findShortest0(new StringBuilder(), pos);
        return shortestStr;
    }

    private void findShortest0(StringBuilder sb, int[] pos) {
        if (allDone(pos)) {
            if (sb.length() < shortest) {
                shortestStr.clear();
                shortest = sb.length();
            }
            if (sb.length() <= shortest)
                shortestStr.add(sb.toString());
            count++;
            if (++count % 100 == 1)
            System.out.println("Searched " + count + " shortest " + shortest);
            return;
        }
        if (sb.length() + maxLeft(pos) > shortest)
            return;
        int[] pos2 = new int[pos.length];
        int i = sb.length();
        sb.append(' ');
        for (char c : CHARS) {
            if (!tryChar(pos, pos2, c)) continue;
            sb.setCharAt(i, c);
            findShortest0(sb, pos2);
        }
        sb.setLength(i);
    }

    private int maxLeft(int[] pos) {
        int maxLeft = 0;
        for (int i = 0; i < pos.length; i++) {
            int left = chars[i].length - pos[i];
            if (left > maxLeft)
                maxLeft = left;
        }
        return maxLeft;
    }

    private boolean allDone(int[] pos) {
        for (int i = 0; i < chars.length; i++)
            if (pos[i] < chars[i].length)
                return false;
        return true;
    }

    private boolean tryChar(int[] pos, int[] pos2, char c) {
        boolean matched = false;
        for (int i = 0; i < chars.length; i++) {
            pos2[i] = pos[i];
            if (pos[i] >= chars[i].length) continue;
            if (chars[i][pos[i]] == c) {
                pos2[i]++;
                matched = true;
            }

        }
        return matched;
    }
}

prints many solutions which are shorter than the one suggested.

expected: 28
Merged: 27 [acgaagccatccgctaaatgctatcga, acgaagccatccgctaaatgctatgca, acgaagccatccgctaacagtgctaga, acgaagccatccgctaacatgctatga, acgaagccatccgctaacatgcttaga, acgaagccatccgctaacatgtctaga, acgaagccatccgctacaagtgctaga, acgaagccatccgctacaatgctatga, acgaagccatccgctacaatgcttaga, acgaagccatccgctacaatgtctaga, acgaagccatcgcgtaaatgctatcga, acgaagccatcgcgtaaatgctatgca, acgaagccatcgcgtaacagtgctaga, acgaagccatcgcgtaacatgctatga, acgaagccatcgcgtaacatgcttaga, acgaagccatcgcgtaacatgtctaga, acgaagccatcgcgtacaagtgctaga, acgaagccatcgcgtacaatgctatga, acgaagccatcgcgtacaatgcttaga, acgaagccatcgcgtacaatgtctaga, acgaagccatgccgtaaatgctatcga, acgaagccatgccgtaaatgctatgca, acgaagccatgccgtaacagtgctaga, acgaagccatgccgtaacatgctatga, acgaagccatgccgtaacatgcttaga, acgaagccatgccgtaacatgtctaga, acgaagccatgccgtacaagtgctaga, acgaagccatgccgtacaatgctatga, acgaagccatgccgtacaatgcttaga, acgaagccatgccgtacaatgtctaga, cagaagccatccgctaaatgctatcga, cagaagccatccgctaaatgctatgca, cagaagccatccgctaacagtgctaga, cagaagccatccgctaacatgctatga, cagaagccatccgctaacatgcttaga, cagaagccatccgctaacatgtctaga, cagaagccatccgctacaagtgctaga, cagaagccatccgctacaatgctatga, cagaagccatccgctacaatgcttaga, cagaagccatccgctacaatgtctaga, cagaagccatcgcgtaaatgctatcga, cagaagccatcgcgtaaatgctatgca, cagaagccatcgcgtaacagtgctaga, cagaagccatcgcgtaacatgctatga, cagaagccatcgcgtaacatgcttaga, cagaagccatcgcgtaacatgtctaga, cagaagccatcgcgtacaagtgctaga, cagaagccatcgcgtacaatgctatga, cagaagccatcgcgtacaatgcttaga, cagaagccatcgcgtacaatgtctaga, cagaagccatgccgtaaatgctatcga, cagaagccatgccgtaaatgctatgca, cagaagccatgccgtaacagtgctaga, cagaagccatgccgtaacatgctatga, cagaagccatgccgtaacatgcttaga, cagaagccatgccgtaacatgtctaga, cagaagccatgccgtacaagtgctaga, cagaagccatgccgtacaatgctatga, cagaagccatgccgtacaatgcttaga, cagaagccatgccgtacaatgtctaga, cagagaccatccgctaaatgctatcga, cagagaccatccgctaaatgctatgca, cagagaccatccgctaacagtgctaga, cagagaccatccgctaacatgctatga, cagagaccatccgctaacatgcttaga, cagagaccatccgctaacatgtctaga, cagagaccatccgctacaagtgctaga, cagagaccatccgctacaatgctatga, cagagaccatccgctacaatgcttaga, cagagaccatccgctacaatgtctaga, cagagaccatcgcgtaaatgctatcga, cagagaccatcgcgtaaatgctatgca, cagagaccatcgcgtaacagtgctaga, cagagaccatcgcgtaacatgctatga, cagagaccatcgcgtaacatgcttaga, cagagaccatcgcgtaacatgtctaga, cagagaccatcgcgtacaagtgctaga, cagagaccatcgcgtacaatgctatga, cagagaccatcgcgtacaatgcttaga, cagagaccatcgcgtacaatgtctaga, cagagaccatgccgtaaatgctatcga, cagagaccatgccgtaaatgctatgca, cagagaccatgccgtaacagtgctaga, cagagaccatgccgtaacatgctatga, cagagaccatgccgtaacatgcttaga, cagagaccatgccgtaacatgtctaga, cagagaccatgccgtacaagtgctaga, cagagaccatgccgtacaatgctatga, cagagaccatgccgtacaatgcttaga, cagagaccatgccgtacaatgtctaga, cagagccatcctagctaaagtgctaga, cagagccatcctagctaaatgctatga, cagagccatcctagctaaatgcttaga, cagagccatcctagctaaatgtctaga, cagagccatcctgactaaagtgctaga, cagagccatcctgactaaatgctatga, cagagccatcctgactaaatgcttaga, cagagccatcctgactaaatgtctaga, cagagccatcctgctaaatgctatcga, cagagccatcctgctaaatgctatgca, cagagccatcctgctaacagtgctaga, cagagccatcctgctaacatgctatga, cagagccatcctgctaacatgcttaga, cagagccatcctgctaacatgtctaga, cagagccatcctgctacaagtgctaga, cagagccatcctgctacaatgctatga, cagagccatcctgctacaatgcttaga, cagagccatcctgctacaatgtctaga]

查看更多
登录 后发表回答