I am struggling with a "find supersequence" algorithm.
The input is for set of strings
String A = "caagccacctacatca";
String B = "cgagccatccgtaaagttg";
String C = "agaacctgctaaatgctaga";
the result would be properly aligned set of strings (and next step should be merge)
String E = "ca ag cca cc ta cat c a";
String F = "c gag ccat ccgtaaa g tt g";
String G = " aga acc tgc taaatgc t a ga";
Thank you for any advice (I am sitting on this task for more than a day)
after merge the superstring would be
cagagaccatgccgtaaatgcattacga
The definition of supersequence in "this case" would be something like
The string R is contained in supersequence S if and only if all characters in a string R are present in supersequence S in the order in which they occur in the input sequence R.
The "solution" i tried (and again its the wrong way of doing it) is:
public class Solution4
{
static boolean[][] map = null;
static int size = 0;
public static void main(String[] args)
{
String A = "caagccacctacatca";
String B = "cgagccatccgtaaagttg";
String C = "agaacctgctaaatgctaga";
Stack data = new Stack();
data.push(A);
data.push(B);
data.push(C);
Stack clone1 = data.clone();
Stack clone2 = data.clone();
int length = 26;
size = max_size(data);
System.out.println(size+" "+length);
map = new boolean[26][size];
char[] result = new char[size];
HashSet<String> chunks = new HashSet<String>();
while(!clone1.isEmpty())
{
String a = clone1.pop();
char[] residue = make_residue(a);
System.out.println("---");
System.out.println("OLD : "+a);
System.out.println("RESIDUE : "+String.valueOf(residue));
String[] r = String.valueOf(residue).split(" ");
for(int i=0; i<r.length; i++)
{
if(r[i].equals(" ")) continue;
//chunks.add(spaces.substring(0,i)+r[i]);
chunks.add(r[i]);
}
}
for(String chunk : chunks)
{
System.out.println("CHUNK : "+chunk);
}
}
static char[] make_residue(String candidate)
{
char[] result = new char[size];
for(int i=0; i<candidate.length(); i++)
{
int pos = find_position_for(candidate.charAt(i),i);
for(int j=i; j<pos; j++) result[j]=' ';
if(pos==-1) result[candidate.length()-1] = candidate.charAt(i);
else result[pos] = candidate.charAt(i);
}
return result;
}
static int find_position_for(char character, int offset)
{
character-=((int)'a');
for(int i=offset; i<size; i++)
{
// System.out.println("checking "+String.valueOf((char)(character+((int)'a')))+" at "+i);
if(!map[character][i])
{
map[character][i]=true;
return i;
}
}
return -1;
}
static String move_right(String a, int from)
{
return a.substring(0, from)+" "+a.substring(from);
}
static boolean taken(int character, int position)
{ return map[character][position]; }
static void take(char character, int position)
{
//System.out.println("taking "+String.valueOf(character)+" at "+position+" (char_index-"+(character-((int)'a'))+")");
map[character-((int)'a')][position]=true;
}
static int max_size(Stack stack)
{
int max=0;
while(!stack.isEmpty())
{
String s = stack.pop();
if(s.length()>max) max=s.length();
}
return max;
}
}
Finding any common supersequence is not a difficult task:
In your example possible solution would be something like:
public class SuperSequenceTest {
}
However the real problem to solve is to find the solution for the known problem of Shortest Common Supersequence http://en.wikipedia.org/wiki/Shortest_common_supersequence, which is not that easy.
There is a lot of researches which concern the topic.
See for instance:
http://www.csd.uwo.ca/~lila/pdfs/Towards%20a%20DNA%20solution%20to%20the%20Shortest%20Common%20Superstring%20Problem.pdf
http://www.ncbi.nlm.nih.gov/pubmed/14534185
You can try finding the shortest combination like this
prints many solutions which are shorter than the one suggested.
expected: 28
Merged: 27 [acgaagccatccgctaaatgctatcga, acgaagccatccgctaaatgctatgca, acgaagccatccgctaacagtgctaga, acgaagccatccgctaacatgctatga, acgaagccatccgctaacatgcttaga, acgaagccatccgctaacatgtctaga, acgaagccatccgctacaagtgctaga, acgaagccatccgctacaatgctatga, acgaagccatccgctacaatgcttaga, acgaagccatccgctacaatgtctaga, acgaagccatcgcgtaaatgctatcga, acgaagccatcgcgtaaatgctatgca, acgaagccatcgcgtaacagtgctaga, acgaagccatcgcgtaacatgctatga, acgaagccatcgcgtaacatgcttaga, acgaagccatcgcgtaacatgtctaga, acgaagccatcgcgtacaagtgctaga, acgaagccatcgcgtacaatgctatga, acgaagccatcgcgtacaatgcttaga, acgaagccatcgcgtacaatgtctaga, acgaagccatgccgtaaatgctatcga, acgaagccatgccgtaaatgctatgca, acgaagccatgccgtaacagtgctaga, acgaagccatgccgtaacatgctatga, acgaagccatgccgtaacatgcttaga, acgaagccatgccgtaacatgtctaga, acgaagccatgccgtacaagtgctaga, acgaagccatgccgtacaatgctatga, acgaagccatgccgtacaatgcttaga, acgaagccatgccgtacaatgtctaga, cagaagccatccgctaaatgctatcga, cagaagccatccgctaaatgctatgca, cagaagccatccgctaacagtgctaga, cagaagccatccgctaacatgctatga, cagaagccatccgctaacatgcttaga, cagaagccatccgctaacatgtctaga, cagaagccatccgctacaagtgctaga, cagaagccatccgctacaatgctatga, cagaagccatccgctacaatgcttaga, cagaagccatccgctacaatgtctaga, cagaagccatcgcgtaaatgctatcga, cagaagccatcgcgtaaatgctatgca, cagaagccatcgcgtaacagtgctaga, cagaagccatcgcgtaacatgctatga, cagaagccatcgcgtaacatgcttaga, cagaagccatcgcgtaacatgtctaga, cagaagccatcgcgtacaagtgctaga, cagaagccatcgcgtacaatgctatga, cagaagccatcgcgtacaatgcttaga, cagaagccatcgcgtacaatgtctaga, cagaagccatgccgtaaatgctatcga, cagaagccatgccgtaaatgctatgca, cagaagccatgccgtaacagtgctaga, cagaagccatgccgtaacatgctatga, cagaagccatgccgtaacatgcttaga, cagaagccatgccgtaacatgtctaga, cagaagccatgccgtacaagtgctaga, cagaagccatgccgtacaatgctatga, cagaagccatgccgtacaatgcttaga, cagaagccatgccgtacaatgtctaga, cagagaccatccgctaaatgctatcga, cagagaccatccgctaaatgctatgca, cagagaccatccgctaacagtgctaga, cagagaccatccgctaacatgctatga, cagagaccatccgctaacatgcttaga, cagagaccatccgctaacatgtctaga, cagagaccatccgctacaagtgctaga, cagagaccatccgctacaatgctatga, cagagaccatccgctacaatgcttaga, cagagaccatccgctacaatgtctaga, cagagaccatcgcgtaaatgctatcga, cagagaccatcgcgtaaatgctatgca, cagagaccatcgcgtaacagtgctaga, cagagaccatcgcgtaacatgctatga, cagagaccatcgcgtaacatgcttaga, cagagaccatcgcgtaacatgtctaga, cagagaccatcgcgtacaagtgctaga, cagagaccatcgcgtacaatgctatga, cagagaccatcgcgtacaatgcttaga, cagagaccatcgcgtacaatgtctaga, cagagaccatgccgtaaatgctatcga, cagagaccatgccgtaaatgctatgca, cagagaccatgccgtaacagtgctaga, cagagaccatgccgtaacatgctatga, cagagaccatgccgtaacatgcttaga, cagagaccatgccgtaacatgtctaga, cagagaccatgccgtacaagtgctaga, cagagaccatgccgtacaatgctatga, cagagaccatgccgtacaatgcttaga, cagagaccatgccgtacaatgtctaga, cagagccatcctagctaaagtgctaga, cagagccatcctagctaaatgctatga, cagagccatcctagctaaatgcttaga, cagagccatcctagctaaatgtctaga, cagagccatcctgactaaagtgctaga, cagagccatcctgactaaatgctatga, cagagccatcctgactaaatgcttaga, cagagccatcctgactaaatgtctaga, cagagccatcctgctaaatgctatcga, cagagccatcctgctaaatgctatgca, cagagccatcctgctaacagtgctaga, cagagccatcctgctaacatgctatga, cagagccatcctgctaacatgcttaga, cagagccatcctgctaacatgtctaga, cagagccatcctgctacaagtgctaga, cagagccatcctgctacaatgctatga, cagagccatcctgctacaatgcttaga, cagagccatcctgctacaatgtctaga]