I'm looking for a Python library for finding the longest common sub-string from a set of strings. There are two ways to solve this problem :
- using suffix trees
- using dynamic programming.
Method implemented is not important. It is important it can be used for a set of strings (not only two strings).
If someone is looking for a generalized version that can also take a list of sequences of arbitrary objects:
Disclaimer This adds very little to jtjacques' answer. However, hopefully, this should be more readable and faster and it didn't fit in a comment, hence why I'm posting this in an answer. I'm not satisfied about
shortest_of
, to be honest.This can be done shorter:
set's are (probably) implemented as hash-maps, which makes this a bit inefficient. If you (1) implement a set datatype as a trie and (2) just store the postfixes in the trie and then force each node to be an endpoint (this would be the equivalent of adding all substrings), THEN in theory I would guess this baby is pretty memory efficient, especially since intersections of tries are super-easy.
Nevertheless, this is short and premature optimization is the root of a significant amount of wasted time.
I prefer this for
is_substr
, as I find it a bit more readable and intuitive:These paired functions will find the longest common string in any arbitrary array of strings:
No doubt the algorithm could be improved and I've not had a lot of exposure to Python, so maybe it could be more efficient syntactically as well, but it should do the job.
EDIT: in-lined the second is_substr function as demonstrated by J.F. Sebastian. Usage remains the same. Note: no change to algorithm.
Hope this helps,
Jason.
From http://bitbucket.org/ned/cog/src/tip/cogapp/whiteutils.py