Runtime of python's if substring in string

2019-01-24 07:39发布

问题:

What is the big O of the following if statement?

if "pl" in "apple":
   ...

What is the overall big O of how python determines if the string "pl" is found in the string "apple"

or any other substring in string search.

Is this the most efficient way to test if a substring is in a string? Does it use the same algorithm as .find()?

回答1:

In python 3.4.2 it looks like they are resorting to the same function, but there may be difference in timing nevertheless. For example s.find first is required to lookup the find method of the string and such.

The algorithm used is a mix between Boyer-More and Horspool.



回答2:

The time complexity is O(N) on average, O(NM) worst case (N being the length of the longer string, M, the shorter string you search for).

The same algorithm is used for str.index(), str.find(), str.__contains__() (the in operator) and str.replace(); it is a simplification of the Boyer-Moore with ideas taken from the Boyer–Moore–Horspool and Sunday algorithms.

See the original stringlib discussion post, as well as the fastsearch.h source code; the algorithm has not changed since introduction in Python 2.5.

The post includes a Python-code outline of the algorithm:

def find(s, p):
    # find first occurrence of p in s
    n = len(s)
    m = len(p)
    skip = delta1(p)[p[m-1]]
    i = 0
    while i <= n-m:
        if s[i+m-1] == p[m-1]: # (boyer-moore)
            # potential match
            if s[i:i+m-1] == p[:m-1]:
                return i
            if s[i+m] not in p:
                i = i + m + 1 # (sunday)
            else:
                i = i + skip # (horspool)
        else:
            # skip
            if s[i+m] not in p:
                i = i + m + 1 # (sunday)
            else:
                i = i + 1
    return -1 # not found

as well as speed comparisons.



回答3:

You can use timeit and test it yourself:

maroun@DQHCPY1:~$ python -m timeit 's = "apple";s.find("pl")'
10000000 loops, best of 3: 0.125 usec per loop
maroun@DQHCPY1:~$ python -m timeit 's = "apple";"pl" in s'
10000000 loops, best of 3: 0.0371 usec per loop

Using in is indeed faster (0.0371 usec compared to 0.125 usec).

For actual implementation, you can look at the code itself.



回答4:

I think the best way to find out is to look at the source. This looks like it would implement __contains__:

static int
bytes_contains(PyObject *self, PyObject *arg)
{
    Py_ssize_t ival = PyNumber_AsSsize_t(arg, PyExc_ValueError);
    if (ival == -1 && PyErr_Occurred()) {
        Py_buffer varg;
        Py_ssize_t pos;
        PyErr_Clear();
        if (PyObject_GetBuffer(arg, &varg, PyBUF_SIMPLE) != 0)
            return -1;
        pos = stringlib_find(PyBytes_AS_STRING(self), Py_SIZE(self),
                             varg.buf, varg.len, 0);
        PyBuffer_Release(&varg);
        return pos >= 0;
    }
    if (ival < 0 || ival >= 256) {
        PyErr_SetString(PyExc_ValueError, "byte must be in range(0, 256)");
        return -1;
    }

    return memchr(PyBytes_AS_STRING(self), (int) ival, Py_SIZE(self)) != NULL;
}

in terms of stringlib_find(), which uses fastsearch().