From Section 15.2 of Programming Pearls
The C codes can be viewed here: http://www.cs.bell-labs.com/cm/cs/pearls/longdup.c
When I implement it in Python using suffix-array:
example = open("iliad10.txt").read()
def comlen(p, q):
i = 0
for x in zip(p, q):
if x[0] == x[1]:
i += 1
else:
break
return i
suffix_list = []
example_len = len(example)
idx = list(range(example_len))
idx.sort(cmp = lambda a, b: cmp(example[a:], example[b:])) #VERY VERY SLOW
max_len = -1
for i in range(example_len - 1):
this_len = comlen(example[idx[i]:], example[idx[i+1]:])
print this_len
if this_len > max_len:
max_len = this_len
maxi = i
I found it very slow for the idx.sort
step. I think it's slow because Python need to pass the substring by value instead of by pointer (as the C codes above).
The tested file can be downloaded from here
The C codes need only 0.3 seconds to finish.
time cat iliad10.txt |./longdup
On this the rest of the Achaeans with one voice were for
respecting the priest and taking the ransom that he offered; but
not so Agamemnon, who spoke fiercely to him and sent him roughly
away.
real 0m0.328s
user 0m0.291s
sys 0m0.006s
But for Python codes, it never ends on my computer (I waited for 10 minutes and killed it)
Does anyone have ideas how to make the codes efficient? (For example, less than 10 seconds)
My solution is based on Suffix array. It is constructed by Prefix doubling of Longest common prefix. The worst-case complexity is O(n (log n)^2). The task "iliad.mb.txt" takes 4 seconds on my laptop. The code is good documented inside functions
suffix_array
andlongest_common_substring
. The latter function is short and can be easily modified e.g. for searching of 10 longest non overlapping substrings. This Python code is faster than the original C code (copy here) from the question, if duplicate strings are longer than 10000 characters.I prefer this solution over more complicated O(n log n) because Python has a very fast list sorting (list.sort), probably faster than necessary linear time operations in the method from that article, that should be O(n) under very special presumptions of random strings together with a small alphabet (typical for DNA genom analyze). I read in Gog 2011 that worsest-case O(n log n) of my algorithm can be in practice faster than many O(n) algorithm, that can not use CPU memory cache.
The code in another answer based on grow_chains is 19 times slower than the original example from the question, if the text contain a repeated string 8 kB long. Long repeated texts are not typical for classical literature, but they are frequent e.g. in "independent" school homeworks collections. The program should not freeze on it.
I wrote an example and tests with the same code for Python 2.7, 3.3 - 3.6.
The main problem seems to be that python does slicing by copy: https://stackoverflow.com/a/5722068/538551
You'll have to use a memoryview instead to get a reference instead of a copy. When I did this, the program hung after the
idx.sort
function (which was very fast).I'm sure with a little work, you can get the rest working.
Edit:
The above change will not work as a drop-in replacement becausecmp
does not work the same way asstrcmp
. For example, try the following C code:And compare the result to this python:
The C code prints
-3
on my machine while the python version prints-1
. It looks like the exampleC
code is abusing the return value ofstrcmp
(it IS used inqsort
after all). I couldn't find any documentation on whenstrcmp
will return something other than[-1, 0, 1]
, but adding aprintf
topstrcmp
in the original code showed a lot of values outside of that range (3, -31, 5 were the first 3 values).To make sure that
-3
wasn't some error code, if we reverse test1 and test2, we'll get3
.Edit:
The above is interesting trivia, but not actually correct in terms of affecting either chunks of code. I realized this just as I shut my laptop and left a wifi zone... Really should double check everything before I hit
Save
.FWIW,
cmp
most certainly works onmemoryview
objects (prints-1
as expected):I'm not sure why the code isn't working as expected. Printing out the list on my machine does not look as expected. I'll look into this and try to find a better solution instead of grasping at straws.
The translation of the algorithm into Python:
buffer()
allows to get a substring without copying:It takes 5 seconds on my machine for the
iliad.mb.txt
.In principle it is possible to find the duplicate in O(n) time and O(n) memory using a suffix array augmented with a lcp array.
Note:
*_memoryview()
is deprecated by*_buffer()
versionMore memory efficient version (compared to longest_duplicate_small()):
It takes 17 seconds on my machine for the
iliad.mb.txt
. The result is:I had to define custom functions to compare
memoryview
objects becausememoryview
comparison either raises an exception in Python 3 or produces wrong result in Python 2:Related questions:
Find the longest repeating string and the number of times it repeats in a given string
finding long repeated substrings in a massive string
This version takes about 17 secs on my circa-2007 desktop using totally different algorithm:
The basic idea is to create a list of substrings and positions where they occure, thus eliminating the need to compare same strings again and again. The resulting list look like
[('ind him, but', [466548, 739011]), (' bulwark bot', [428251, 428924]), (' his armour,', [121559, 124919, 193285, 393566, 413634, 718953, 760088])]
. Unique strings are removed. Then every list member grows by 1 character and new list is created. Unique strings are removed again. And so on and so forth...