Are there any guarantees about the splitting order

2020-03-07 03:42发布

问题:

According to the Python 2.7 docs, using str.split() with maxsplit specified will split a string up to maxsplit times.

However, it never explicitly specifies that these splits will be executed left to right. There is a related function str.rsplit() that guarantees right to left split ordering.

Aside from doing string reverse followed by str.rsplit(), is there any way to guarantee a left to right splitting order? Are there any situations where str.split() will NOT use a left to right order?

回答1:

If you're looking for guarantees that splitting with the maxsplit argument splits from left-to-right, you only need to look at the builtin python test suite.

Here's an excerpt:

    self.checkequal(['a', 'b', 'c', 'd'], 'a|b|c|d', 'split', '|')
    self.checkequal(['a|b|c|d'], 'a|b|c|d', 'split', '|', 0)
    self.checkequal(['a', 'b|c|d'], 'a|b|c|d', 'split', '|', 1)
    self.checkequal(['a', 'b', 'c|d'], 'a|b|c|d', 'split', '|', 2)
    self.checkequal(['a', 'b', 'c', 'd'], 'a|b|c|d', 'split', '|', 3)
    self.checkequal(['a', 'b', 'c', 'd'], 'a|b|c|d', 'split', '|', 4)
    self.checkequal(['a', 'b', 'c', 'd'], 'a|b|c|d', 'split', '|',
                    sys.maxsize-2)
    self.checkequal(['a|b|c|d'], 'a|b|c|d', 'split', '|', 0)
    self.checkequal(['a', '', 'b||c||d'], 'a||b||c||d', 'split', '|', 2)
    self.checkequal(['abcd'], 'abcd', 'split', '|')
    self.checkequal([''], '', 'split', '|')
    self.checkequal(['endcase ', ''], 'endcase |', 'split', '|')
    self.checkequal(['', ' startcase'], '| startcase', 'split', '|')
    self.checkequal(['', 'bothcase', ''], '|bothcase|', 'split', '|')
    self.checkequal(['a', '', 'b\x00c\x00d'], 'a\x00\x00b\x00c\x00d', 'split', '\x00', 2)

From the tests, it is clear that any implementation that did something different would fail these tests.



回答2:

CPython is considered to be the reference implementation of Python. According to CPython source code str.split is guaranteed to split in left-to-right order. You can look up how str.split is implemented, here is a link http://svn.python.org/view/python/tags/r271/Objects/stringlib/split.h?view=markup

For example, in stringlib_split_char (as well as in stringlib_split_whitespace, which are both used in stringlib_split (str.split)) one can clearly see that the string is processed from left to right (i and j are used to index the string, they both start with zero and are being incremented, maxsplit does not affect how indexes are treated, maxsplit only provides early exit from the loop):

Py_LOCAL_INLINE(PyObject *)
stringlib_split_char(PyObject* str_obj,
                     const STRINGLIB_CHAR* str, Py_ssize_t str_len,
                     const STRINGLIB_CHAR ch,
                     Py_ssize_t maxcount)
{
    // ... some code omitted

    i = j = 0;
    while ((j < str_len) && (maxcount-- > 0)) {
        for(; j < str_len; j++) {
            /* I found that using memchr makes no difference */
            if (str[j] == ch) {
                SPLIT_ADD(str, i, j);
                i = j = j + 1;
                break;
            }
        }
    }
    // ... some code omitted

And in stringlib_rsplit_char (used in str.rsplit) both i and j indexes start at the end of string and being decremented:

i = j = str_len - 1;
while ((i >= 0) && (maxcount-- > 0)) {
    for(; i >= 0; i--) {
        if (str[i] == ch) {
            SPLIT_ADD(str, i + 1, j + 1);
            j = i = i - 1;
            break;
        }
    }