Python string comparison pointing to the result

2020-02-01 15:15发布

Im trying to compare 2 1000 byte string and would like to know where the difference exactly starts, ie; from which byte the string is different.. Is there any function to determine it?

标签: python
6条回答
Rolldiameter
2楼-- · 2020-02-01 15:27
>>> s1 = 'stop at the bus'
>>> s2 = 'stop at the car'
>>> import difflib
>>> next(difflib.SequenceMatcher(a=s1, b=s2).get_matching_blocks())
Match(a=0, b=0, size=12)

This means the the first matching block is 12 characters long.

If either a or b isn't 0, the strings differ right from the beginning

查看更多
神经病院院长
3楼-- · 2020-02-01 15:34

If you want something more complex, you can have a look at SequenceMatcher

It is a bit hairy, but very powerful. If you simply want to answer your question, then :

from  difflib import SequenceMatcher

s1 = 'stop at the bus'
s2 = 'stop at the car'

s = SequenceMatcher(None, s1, s2)

print s.get_matching_blocks()[0].size

returns the solution :)

But if you want all the matches :

Small example:

from  difflib import SequenceMatcher

s1 = 'stop at the bus'
s2 = 'stop at the car'

s = SequenceMatcher(None, s1, s2)

print s.get_matching_blocks()

returns

[Match(a=0, b=0, size=12), Match(a=15, b=15, size=0)]

which means that the longest match in your strings is of size 12, and starts at the beginning (0). But there is another match, starting at s1[15], and of size 0 . . .

For big strings like yours, this is something that could be really interesting. :)

查看更多
疯言疯语
4楼-- · 2020-02-01 15:36

This might be overkill, but since you seem to be concerned about speed, you could consider using numpy. There are probably improvements to be made (for some reason inlining made a 25 us difference for me), but this is a first step:

>>> def diff_index(s1, s2):
...     s1 = numpy.fromstring(s1, dtype=numpy.uint8)
...     s2 = numpy.fromstring(s2, dtype=numpy.uint8)
...     return (~(s1 == s2)).nonzero()[0][0]
... 
>>> base = string.lowercase * 385
>>> s1 = base + 'a'
>>> s2 = base + 'z'
>>> diff_index(s1, s2)
10010

For differences at the end, this is a lot faster than a genex:

>>> %timeit next(idx for idx,(c1,c2) in enumerate(izip(s1,s2)) if c1 != c2)
1000 loops, best of 3: 1.46 ms per loop
>>> %timeit diff_index(s1, s2)
10000 loops, best of 3: 87.6 us per loop

It's a lot slower for differences at the very beginning...

>>> s1 = 'a' + base
>>> s2 = 'z' + base
>>> %timeit next(idx for idx,(c1,c2) in enumerate(izip(s1,s2)) if c1 != c2)
100000 loops, best of 3: 2.12 us per loop
>>> %timeit diff_index(s1, s2)
10000 loops, best of 3: 87.5 us per loop

But on average, it wins by an order of magnitude:

>>> s1 = base[:5000] + 'a' + base[5000:]
>>> s2 = base[:5000] + 'z' + base[5000:]
>>> %timeit next(idx for idx,(c1,c2) in enumerate(izip(s1,s2)) if c1 != c2)
1000 loops, best of 3: 724 us per loop
>>> %timeit diff_index(s1, s2)
10000 loops, best of 3: 87.2 us per loop

If speed is not a concern, though, then I'd personally go for mgilson's answer.

查看更多
▲ chillily
5楼-- · 2020-02-01 15:39

I've tried to test the answers given here and I've came up with an other, faster(in usual cases), solution, even though it is less elegant.

First of all let's see how fast are the proposed solutions:

In [15]: def check_genexp(a, b):
    ...:     return next(idx for idx, c in enumerate(a) if c != b[idx])

In [16]: %timeit check_genexp("a"*9999 + "b", "a"*9999 + "c")
1000 loops, best of 3: 1.04 ms per loop

In [17]: from difflib import SequenceMatcher

In [18]: def check_matcher(a, b):
    ...:     return next(SequenceMatcher(a=a, b=b).get_matching_blocks())
    ...: 

In [19]: %timeit check_matcher("a"*9999+"b", "a"*9999+"c")
100 loops, best of 3: 11.5 ms per loop

As you can see the genexp is a lot faster than difflib, but this is probably due to the fact that SequenceMatcher does a lot more than finding the first non-equal character.

Now, how could we speed things up? Well, we can use "binary search"!!! The idea is that if two strings are not equal, than either the first halves are different or the second one are different(or both, but in that case we only care for the first half since we want the first differing index).

So we can do something like this:

def binary_check(a, b):
    len_a, len_b = len(a), len(b)
    if len_a == len_b:
        return binary_check_helper(a, b)
    min_length, max_length = min(len_a, len_b), max(len_a, len_b)
    res = binary_check_helper(a[:min_length], b[:min_length])
    return res if res >= 0 else min_length

def binary_check_helper(a, b):
    if a == b:
        return -1
    length = len(a)

    if length == 1:
        return int(a[0] == b[0])
    else:
        half_length = length // 2
        r = binary_check_helper(a[:half_length], b[:half_length])
        if r >= 0:
            return r
        r = binary_check(a[half_length:], b[half_length:])
        if r >= 0:
            return r + half_length
        return r

And the result:

In [34]: %timeit binary_check("a"*9999 + "b", "a"*9999 + "c")
10000 loops, best of 3: 28.4 µs per loop

That's more than thirty five times faster than the genexp!

Why this works? The comparisons obviously take linear time, so it looks like we are doing a lot more work than before... and that's indeed true but it's done at the "C level", and thus the result is that this method is actually faster.

Note that this is somehow "implementation specific" because implementations such as PyPy could probably optimize the genexp into a single C-for loop and that would beat anything; also on implementations like Jython or IronPython could be a lot slower than with CPython.

This method has the same asymptotic complexity of the other methods, i.e. O(n). The strings are splitted in half at most log_2(n) times and each time an equality test is done, which takes linear time. At first sight it may seem a Θ(n * logn) algorithm, but that's not the case. The recurrence equation is:

T(n) = T(n//2) + Θ(n) = Σ_{i=0}^{logn}(n/2^i)
     = Θ(n(1 + 1/2 + 1/4 + ...)) <= Θ(2n) = Θ(n)

Some more results:

In [37]: %timeit binary_check("a"*10**6 + "b", "a"*10**6 + "c")
100 loops, best of 3: 2.84 ms per loop

In [38]: %timeit check_genexp("a"*10**6 + "b", "a"*10**6 + "c")
10 loops, best of 3: 101 ms per loop

In [39]: %timeit binary_check(15 * "a"*10**6 + "b", 15 * "a"*10**6 + "c")
10 loops, best of 3: 53.3 ms per loop

In [40]: %timeit check_genexp(15 * "a"*10**6 + "b", 15 * "a"*10**6 + "c")
1 loops, best of 3: 1.5 s per loop

As you can see even with huge strings this method is still about thirty times faster.

Note: The downside of this solution is that it is ϴ(n) and not O(n), i.e. it always read the whole string to return the result. Even when the first character is already different. In fact:

In [49]: a = "b" + "a" * 15 * 10**6
    ...: b = "c" + "a" * 15 * 10**6
    ...: 

In [50]: %timeit binary_check(a, b)
100 loops, best of 3: 10.3 ms per loop

In [51]: %timeit check_genexp(a, b)
1000000 loops, best of 3: 1.3 µs per loop

This is to be expected. However it takes very little for this solution to become more performant then the explicit loop:

In [59]: a = "a" * 2 * 10**5 + "b" + "a" * 15*10**6
    ...: b = "a" * 2 * 10**5 + "c" + "a" * 15*10**6

In [60]: %timeit check_genexp(a, b)
10 loops, best of 3: 20.3 ms per loop

In [61]: %timeit binary_check(a, b)
100 loops, best of 3: 17.3 ms per loop

According to this simple benchmark, with a big string if the difference is farther than about 1.3% of the total length, the binary check is better.

It is also possible to introduce some heuristics. For example if the minimum length of the two strings is greater than a certain cutoff value you first only check whether the prefixes at that cutoff are different, if they are you can't disregard everything after that, thus avoiding comparing the whole strings. This can be trivially implemented:

def binary_check2(a, b, cutoff=1000):
    len_a, len_b = len(a), len(b)
    if min(len_a, len_b) > cutoff:
        small_a, small_b = a[:cutoff], b[:cutoff]
        if small_a != small_b:
            return binary_check_helper(a[:cutoff], b[:cutoff])
    # same as before

Depending on the application you can choose a cutoff that minimize the average time. In any case this is an ad hoc heuristics that may or may not work well, so if you are dealing with very long strings with only short common prefixes you should use a "fail-fast" algorithm as is the genexp approach.


timings performed on python3.4. Using bytes instead of unicode strings doesn't change significantly the results

查看更多
看我几分像从前
6楼-- · 2020-02-01 15:42
for i, (x, y) in enumerate(zip(a, b)):
    if x != y:
        print('First index where strings are different:', i)
        break
else:
    print('Strings are identical.')

In Python 2.x, zip() returns a list of tuples, not a iterator. As gnibbler pointed out, if you're using Python 2.x, it might be worth your while to use izip rather than zip (izip returns a nice, memory efficient iterator that avoids evaluating all of the values at once). As I said in the commments though, in Python 3, izip has been renamed zip and the old zip is gone.

查看更多
We Are One
7楼-- · 2020-02-01 15:51

Maybe use next plus a generator?

next(idx for idx,c in enumerate(your_string1) if c != your_string2[idx])

This will give you the index where the difference starts and raise StopIteration if they are equal.

It might even be slightly more elegant with itertools.izip:

next(idx for idx,(c1,c2) in enumerate(izip(s1,s2)) if c1 != c2)

Example:

>>> from itertools import izip
>>> s1 = 'stop at the bus'
>>> s2 = 'stop at the car'
>>> next(idx for idx,(c1,c2) in enumerate(izip(s1,s2)) if c1 != c2)
12
>>> s1[12]
'b'
>>> s2[12]
'c'
查看更多
登录 后发表回答