According to the Python 2.7 docs, using str.split()
with maxsplit
specified will split a string up to maxsplit
times.
However, it never explicitly specifies that these splits will be executed left to right. There is a related function str.rsplit()
that guarantees right to left split ordering.
Aside from doing string reverse followed by str.rsplit()
, is there any way to guarantee a left to right splitting order? Are there any situations where str.split()
will NOT use a left to right order?
If you're looking for guarantees that splitting with the maxsplit
argument splits from left-to-right, you only need to look at the builtin python test suite.
Here's an excerpt:
self.checkequal(['a', 'b', 'c', 'd'], 'a|b|c|d', 'split', '|')
self.checkequal(['a|b|c|d'], 'a|b|c|d', 'split', '|', 0)
self.checkequal(['a', 'b|c|d'], 'a|b|c|d', 'split', '|', 1)
self.checkequal(['a', 'b', 'c|d'], 'a|b|c|d', 'split', '|', 2)
self.checkequal(['a', 'b', 'c', 'd'], 'a|b|c|d', 'split', '|', 3)
self.checkequal(['a', 'b', 'c', 'd'], 'a|b|c|d', 'split', '|', 4)
self.checkequal(['a', 'b', 'c', 'd'], 'a|b|c|d', 'split', '|',
sys.maxsize-2)
self.checkequal(['a|b|c|d'], 'a|b|c|d', 'split', '|', 0)
self.checkequal(['a', '', 'b||c||d'], 'a||b||c||d', 'split', '|', 2)
self.checkequal(['abcd'], 'abcd', 'split', '|')
self.checkequal([''], '', 'split', '|')
self.checkequal(['endcase ', ''], 'endcase |', 'split', '|')
self.checkequal(['', ' startcase'], '| startcase', 'split', '|')
self.checkequal(['', 'bothcase', ''], '|bothcase|', 'split', '|')
self.checkequal(['a', '', 'b\x00c\x00d'], 'a\x00\x00b\x00c\x00d', 'split', '\x00', 2)
From the tests, it is clear that any implementation that did something different would fail these tests.
CPython is considered to be the reference implementation of Python. According to CPython source code str.split
is guaranteed to split in left-to-right order. You can look up how str.split
is implemented, here is a link http://svn.python.org/view/python/tags/r271/Objects/stringlib/split.h?view=markup
For example, in stringlib_split_char
(as well as in stringlib_split_whitespace
, which are both used in stringlib_split
(str.split
)) one can clearly see that the string is processed from left to right (i
and j
are used to index the string, they both start with zero and are being incremented, maxsplit
does not affect how indexes are treated, maxsplit
only provides early exit from the loop):
Py_LOCAL_INLINE(PyObject *)
stringlib_split_char(PyObject* str_obj,
const STRINGLIB_CHAR* str, Py_ssize_t str_len,
const STRINGLIB_CHAR ch,
Py_ssize_t maxcount)
{
// ... some code omitted
i = j = 0;
while ((j < str_len) && (maxcount-- > 0)) {
for(; j < str_len; j++) {
/* I found that using memchr makes no difference */
if (str[j] == ch) {
SPLIT_ADD(str, i, j);
i = j = j + 1;
break;
}
}
}
// ... some code omitted
And in stringlib_rsplit_char
(used in str.rsplit
) both i
and j
indexes start at the end of string and being decremented:
i = j = str_len - 1;
while ((i >= 0) && (maxcount-- > 0)) {
for(; i >= 0; i--) {
if (str[i] == ch) {
SPLIT_ADD(str, i + 1, j + 1);
j = i = i - 1;
break;
}
}