Python: Improving sub-string search by embedding s

2019-07-25 00:20发布

问题:

I am extending my previous question python efficient substring search,

I am interested to improve the performance of sub-string search implementation,

Some of the answers from my previous question pointed out that substring search is implemented by using fastsearch that uses an inspired by B-M algorithm, here is the source code

More answers have pointed me to a python implementation of Boyer–Moore Algorithm, Rabin–Karp algorithm.

will it be efficient to embed c code as a good implementation of substring search using those algorithms (B-M,Rabin-Karp)?

回答1:

You haven't specified by what you mean by 'efficient'. What tradeoffs are you willing to make? Would you be prepared to pay a price in performance loss when initializing a new string? When starting the search? Would you trade more memory for more speed?

The python developers set clear goals when they developed the python string library:

  • should be faster than the current brute-force algorithm for all test cases (based on real-life code), including Jim Hugunin’s worst-case test
  • small setup overhead; no dynamic allocation in the fast path (O(m) for speed, O(1) for storage)
  • sublinear search behaviour in good cases (O(n/m))
  • no worse than the current algorithm in worst case (O(nm))
  • should work well for both 8-bit strings and 16-bit or 32-bit Unicode strings (no O(σ) dependencies)
  • many real-life searches should be good, very few should be worst case
  • reasonably simple implementation

So the devs had set some limits on performance for the search case and the setup case, on storage requirements, and also on maintenance efficiency. Those boundaries ruled out Boyer–Moore (as it requires preprocessing on the searched-for string, a startup cost and a storage cost), and although I see no evidence that the devs considered Rabin-Karp, it can be ruled out on the same grounds (you need to create the hashes and store these).

The boundaries were set based on a lot of python internals and usage experience. The above summary wasn't pulled out of thin air, it is merely a summary of that experience.

Now, if you have a specific case where your trade-offs can be set differently, then sure, a C implementation of a different algorithm could well beat the standard Python implementation. But it'll be more efficient according to a different set of criteria.

In any case, the Python search algorithm deals with the small strings case. If you try to apply it to a large body of text, the algorithms will not be able to perform as well as one that makes different choices that work well for large texts. And if you had to search for text through 10,000,000 documents you'd want to use some kind of indexing solution instead puny little python string search.

Compare that to sorting a list of 100 items with the default sort implementation, vs. sorting 10,000,000,000 integers. In the latter case there are sorting implementations that can easily beat the default Python offer.

It should also be noted that Python has a history of algorithm innovation; the standard sort algorithm in Python is TimSort, a new algorithm invented by Tim Peters to fit the pragmatic real-life circumstances the Python interpreter has to deal with. That algorithm has since been made the default in Java and the Android platform as well. Thus, I tend to trust the Python core devs decisions.

As far as I know, noone has embedded a different implementation, as replacing the default is not going to work without patching the Python C code. You can easily create a specialized string type that implements a different search algorithm, of course. There may well be libraries out there that use C for specialized search algorithms that use Boyer-Moore, Rabin-Karp or any other algorithm, as that might well be the better choice for their specific problem domain.