可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am looking for a fast algorithm for search purpose in a huge string (it's a organism genome sequence composed of hundreds of millions to billions of chars).

There are only 4 chars {A,C,G,T} present in this string, and "A" can only pair with "T" while "C" pairs with "G".

Now I am searching for two substrings (with length constraint of both substring between {minLen, maxLen}, and interval length between {intervalMinLen, intervalMaxLen}) that can pair with one another antiparallely.

For example, The string is: ATCAG GACCA TACGC CTGAT

Constraints: minLen = 4, maxLen = 5, intervalMinLen = 9, intervalMaxLen = 10

The result should be

"ATCAG" pair with "CTGAT"
"TCAG" pair with "CTGA"

Thanks in advance.

Update: I already have the method to determine whether two string can pair with one another. The only concern is doing exhaustive search is very time consuming.

回答1:

I thought this was an interesting problem, so I put together a program based on considering 'foldings', which scans outward for possible symmetrical matches from different 'fold points'. If N is the number of nucleotides and M is 'maxInterval-minInterval', you should have running time O(N*M). I may have missed some boundary cases, so use the code with care, but it does work for the example provided. Note that I've used a padded intermediate buffer to store the genome, as this reduces the number of comparisons for boundary cases required in the inner loops; this trades off additional memory allocation for better speed. Feel free to edit the post if you make any corrections or improvements.

class Program
{
    public sealed class Pairing
    {
        public int Index { get; private set; }

        public int Length { get; private set; }

        public int Offset { get; private set; }

        public Pairing(int index, int length, int offset)
        {
            Index = index;
            Length = length;
            Offset = offset;
        }
    }

    public static IEnumerable<Pairing> FindPairings(string genome, int minLen, int maxLen, int intervalMinLen, int intervalMaxLen)
    {
        int n = genome.Length;
        var padding = new string((char)0, maxLen);
        var padded = string.Concat(padding, genome, padding);

        int start = (intervalMinLen + minLen)/2 + maxLen;
        int end = n - (intervalMinLen + minLen)/2 + maxLen;

        //Consider 'fold locations' along the genome
        for (int i=start; i<end; i++)
        {
            //Consider 'odd' folding (centered on index) about index i
            int k = (intervalMinLen+2)/2;
            int maxK = (intervalMaxLen + 2)/2;
            while (k<=maxK)
            {
                int matchLength = 0;
                while (IsPaired(padded[i - k], padded[i + k]) && (k <= (maxK+maxLen)))
                {
                    matchLength++;

                    if (matchLength >= minLen && matchLength <= maxLen)
                    {
                        yield return new Pairing(i-k - maxLen, matchLength, 2*k - (matchLength-1));
                    }
                    k++;
                }
                k++;
            }

            //Consider 'even' folding (centered before index) about index i
            k = (intervalMinLen+1)/2;
            while (k <= maxK)
            {
                int matchLength = 0;
                while (IsPaired(padded[i - (k+1)], padded[i + k]) && (k<=maxK+maxLen))
                {
                    matchLength++;

                    if (matchLength >= minLen && matchLength <= maxLen)
                    {
                        yield return new Pairing(i - (k+1) - maxLen, matchLength, 2*k + 1 - (matchLength-1));
                    }
                    k++;
                }
                k++;
            }
        }
    }

    private const int SumAT = 'A' + 'T';
    private const int SumGC = 'G' + 'C';
    private static bool IsPaired(char a, char b)
    {
        return (a + b) == SumAT || (a + b) == SumGC;
    }


    static void Main(string[] args)
    {
        string genome = "ATCAGGACCATACGCCTGAT";
        foreach (var pairing in FindPairings(genome, 4, 5, 9, 10))
        {
            Console.WriteLine("'{0}' pair with '{1}'",
                              genome.Substring(pairing.Index, pairing.Length),
                              genome.Substring(pairing.Index + pairing.Offset, pairing.Length));
        }
        Console.ReadKey();
    }


}

回答2:

I know you aren't searching for substrings, but I think it might be worthwhile to create a reversed genome string containing the matches; the task would then be to find common substrings in the two strings.

Example:

Original string

  ATCAG GACCA TACGC CTGAT

Reversed string:

  TAGTC CGCAT ACCAG GACTA

If you then transform the string to it's pairing values (replace T<->A and C<->G, you get something useful:

  ATCAG GCGTA TGGTC CTGAT

I know that this preprocessing will be costly and consume a lot of space, but you will be able to use standard string algorithms afterwards and with the amount of comparisons you are searching, it certainly will be justfied.

When the original string and the reversed lookup string I think your problem sounds surprisingly alike to the 'longest common substring' problem which is well described. Your second preprocessing would be to construct a suffix tree to allow fast lookup of substrings.

you will end up with quadratic running times, but I doubt you can do better

回答3:

Easiest solution would be to construct a Trie from the data of maximum height maxLen. Each node should also have a list of locations where that particular string was found (it will be in ascending order, if the trie is created in order).

Then, while searching, just reverse compliment the search string and go through the trie. When you find a match check if one of the matches is located in proper interval, if 'yes' then output the strings.

Let me know if you need the detailed algorithm.

回答4:

The method is described in http://www.ncbi.nlm.nih.gov/pubmed/11381028

Genome Res. 2001 Jun;11(6):1005-17. Segmental duplications: organization and impact within the current human genome project assembly.

回答5:

I'd consider converting the strings to binary in 16 bit lengths:

A = 101
T = 010
C = 110
G = 001

This allows up to 5 characters per 16 bit unit. This should be very fast compared to string searches and comparisons.

Use the top bit to determine if it's a 4 sequence unit or 5 sequence unit.

Now you can do a quick sort and get all the 4 sequence and 5 sequence units into their own groups (unless you need to find 4 sequence units that partial match against 5 sequence units).

For the comparison you can sort again, and you'll find that all the sequences that start with G will come before sequences that start with T, then A then C. You can then take a sequence and compare it to just the parts of the sorted array that are possible - which should be a much, much shorter problem space.

Further, the reason I've chosen those three bit encodings for the sequences is that you can simply xor the two strings together to see if they match. If you get 15 1's in sequence, then the two match perfectly. If not, then they don't.

You'll have to figure out the anti-parallel bit - I have a few thoughts on that, but this should get you started, and if the anit-parallel part becomes an issue, ask another question.