Im trying to compare 2 1000 byte string and would like to know where the difference exactly starts, ie; from which byte the string is different.. Is there any function to determine it?
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to get the background from multiple images by
- Evil ctypes hack in python
- Correctly parse PDF paragraphs with Python
This means the the first matching block is 12 characters long.
If either a or b isn't 0, the strings differ right from the beginning
If you want something more complex, you can have a look at SequenceMatcher
It is a bit hairy, but very powerful. If you simply want to answer your question, then :
returns the solution :)
But if you want all the matches :
Small example:
returns
which means that the longest match in your strings is of size 12, and starts at the beginning (0). But there is another match, starting at s1[15], and of size 0 . . .
For big strings like yours, this is something that could be really interesting. :)
This might be overkill, but since you seem to be concerned about speed, you could consider using numpy. There are probably improvements to be made (for some reason inlining made a 25 us difference for me), but this is a first step:
For differences at the end, this is a lot faster than a genex:
It's a lot slower for differences at the very beginning...
But on average, it wins by an order of magnitude:
If speed is not a concern, though, then I'd personally go for mgilson's answer.
I've tried to test the answers given here and I've came up with an other, faster(in usual cases), solution, even though it is less elegant.
First of all let's see how fast are the proposed solutions:
As you can see the genexp is a lot faster than
difflib
, but this is probably due to the fact thatSequenceMatcher
does a lot more than finding the first non-equal character.Now, how could we speed things up? Well, we can use "binary search"!!! The idea is that if two strings are not equal, than either the first halves are different or the second one are different(or both, but in that case we only care for the first half since we want the first differing index).
So we can do something like this:
And the result:
That's more than thirty five times faster than the genexp!
Why this works? The comparisons obviously take linear time, so it looks like we are doing a lot more work than before... and that's indeed true but it's done at the "C level", and thus the result is that this method is actually faster.
Note that this is somehow "implementation specific" because implementations such as PyPy could probably optimize the genexp into a single C-for loop and that would beat anything; also on implementations like Jython or IronPython could be a lot slower than with CPython.
This method has the same asymptotic complexity of the other methods, i.e. O(n). The strings are splitted in half at most
log_2(n)
times and each time an equality test is done, which takes linear time. At first sight it may seem a Θ(n * logn) algorithm, but that's not the case. The recurrence equation is:Some more results:
As you can see even with huge strings this method is still about thirty times faster.
Note: The downside of this solution is that it is ϴ(n) and not O(n), i.e. it always read the whole string to return the result. Even when the first character is already different. In fact:
This is to be expected. However it takes very little for this solution to become more performant then the explicit loop:
According to this simple benchmark, with a big string if the difference is farther than about 1.3% of the total length, the binary check is better.
It is also possible to introduce some heuristics. For example if the minimum length of the two strings is greater than a certain cutoff value you first only check whether the prefixes at that cutoff are different, if they are you can't disregard everything after that, thus avoiding comparing the whole strings. This can be trivially implemented:
Depending on the application you can choose a cutoff that minimize the average time. In any case this is an ad hoc heuristics that may or may not work well, so if you are dealing with very long strings with only short common prefixes you should use a "fail-fast" algorithm as is the
genexp
approach.timings performed on python3.4. Using bytes instead of unicode strings doesn't change significantly the results
In Python 2.x,
zip()
returns a list of tuples, not a iterator. As gnibbler pointed out, if you're using Python 2.x, it might be worth your while to useizip
rather thanzip
(izip
returns a nice, memory efficient iterator that avoids evaluating all of the values at once). As I said in the commments though, in Python 3,izip
has been renamedzip
and the oldzip
is gone.Maybe use
next
plus a generator?This will give you the index where the difference starts and raise
StopIteration
if they are equal.It might even be slightly more elegant with
itertools.izip
:Example: