How to find all the unique substrings of a very lo

I have a very long string. I want to find all the unique substrings of this string. I tried to write the code where I used a set(python) to store all the substrings to ensure uniqueness. I am getting correct result for many medium and large strings however in case of very large strings, I am getting a MemoryError. I googled a bit and found out that the set data structure in python has a large RAM footprint and maybe thats why I am getting a MemoryError.

Here is my code :

a = set()
for i in range(n):
    string = raw_input()
    j = 1
    while True:
        for i in xrange(len(string)-j+1):   
            a.add(string[i:i+j])
        if j==len(string):   break
        j+=1
print sorted(list(a))

Is there a way to avoid this error for large strings? Or can anybody suggest a better modification in my code to handle this issue?

P.S: I donot have an option of shifting between 32 bit and 64 bit versions.

回答1:

If you really need it in memory, then you can try making a suffix tree. Tries are not exotic data structures, so there are probably good implementations available for a mainstream language like Python, and they can be used to implement suffix trees. Marisa-Trie is supposed to get good memory usage.

Create an empty trie.
For each n in [0, len(s)], add the suffix of length n to the Trie.
Every path from the root of the trie is a substring in the string, there are no such paths that are not substrings in the string, and paths are unique.

回答2:

Here is some Python code based on a O(n) suffix tree construction to produced the unique substrings from a collection of input strings (the output should appear in sorted order so there is no need to sort the strings afterwards).

As there can be O(n^2) output strings it may take a long time to actually output all the strings.

from collections import defaultdict

class SuffixTree:
    def __init__(self):
        """Returns an empty suffix tree"""
        self.T=''
        self.E={}
        self.nodes=[-1]

    def add(self,s):
        """Adds the input string to the suffix tree.

        This inserts all substrings into the tree.
        End the string with a unique character if you want a leaf-node for every suffix.

        Produces an edge graph keyed by (node,character) that gives (first,last,end)
        This means that the edge has characters from T[first:last+1] and goes to node end."""
        origin,first,last = 0,len(self.T),len(self.T)-1
        self.T+=s
        nc = len(self.nodes)
        self.nodes += [-1]*(2*len(s))
        T=self.T
        E=self.E
        nodes=self.nodes

        Lm1=len(T)-1
        for last_char_index in xrange(first,len(T)):
            c=T[last_char_index]
            last_parent_node = -1                    
            while 1:
                parent_node = origin
                if first>last:
                    if (origin,c) in E:
                        break             
                else:
                    key = origin,T[first]
                    edge_first, edge_last, edge_end = E[key]
                    span = last - first
                    A = edge_first+span
                    m = T[A+1]
                    if m==c:
                        break
                    E[key] = (edge_first, A, nc)
                    nodes[nc] = origin
                    E[nc,m] = (A+1,edge_last,edge_end)
                    parent_node = nc
                    nc+=1  
                E[parent_node,c] = (last_char_index, Lm1, nc)
                nc+=1  
                if last_parent_node>0:
                    nodes[last_parent_node] = parent_node
                last_parent_node = parent_node
                if origin==0:
                    first+=1
                else:
                    origin = nodes[origin]

                if first <= last:
                    edge_first,edge_last,edge_end=E[origin,T[first]]
                    span = edge_last-edge_first
                    while span <= last - first:
                        first+=span+1
                        origin = edge_end
                        if first <= last:
                            edge_first,edge_last,edge_end = E[origin,T[first]]
                            span = edge_last - edge_first

            if last_parent_node>0:
                nodes[last_parent_node] = parent_node
            last+=1
            if first <= last:
                    edge_first,edge_last,edge_end=E[origin,T[first]]
                    span = edge_last-edge_first
                    while span <= last - first:
                        first+=span+1
                        origin = edge_end
                        if first <= last:
                            edge_first,edge_last,edge_end = E[origin,T[first]]
                            span = edge_last - edge_first
        return self

    def make_choices(self):
        """Construct a sorted list for each node of the possible continuing characters"""
        choices = self.choices = [list() for n in xrange(len(self.nodes))] # Contains set of choices for each node
        for (origin,c),edge in self.E.items():
            choices[origin].append(c)
        choices=[sorted(s) for s in choices] # should not have any repeats by construction
        return choices

    def find_substrings(self,A,term):
        """Recurses through the tree appending unique substrings into A.
        Strings assumed to use term as the terminating character"""
        choices = self.make_choices()
        def f(node,depth):
            t=0
            for c in choices[node]:
                if c==term: continue
                first,last,end = self.E[node,c]
                # All end points along this edge result in new unique substrings
                edge_len = last-first+1
                a = first-depth
                for b in range(first,last+1):
                    if self.T[b]!=term:
                        A.append( self.T[a:b+1] )
                f(end,depth+edge_len)
            return t
        return f(0,0)

def fast_find_all_substrings(strings):
    S = SuffixTree()
    term = '\0'
    for string in strings:
        S.add(string+term)
    A=[]
    S.find_substrings(A,term)
    return A

A="abc","abcd","bca"
print fast_find_all_substrings(A)