I am using split('\n')
to get lines in one string, and found that ''.split()
returns an empty list, []
, while ''.split('\n')
returns ['']
. Is there any specific reason for such a difference?
And is there any more convenient way to count lines in a string?
It seems to simply be the way it's supposed to work, according to the documentation:
So, to make it clearer, the
split()
function implements two different splitting algorithms, and uses the presence of an argument to decide which one to run. This might be because it allows optimizing the one for no arguments more than the one with arguments; I don't know..split()
without parameters tries to be clever. It splits on any whitespace, tabs, spaces, line feeds etc, and it also skips all empty strings as a result of this.Essentially,
.split()
without parameters are used to extract words from a string, as opposed to.split()
with parameters which just takes a string and splits it.That's the reason for the difference.
And yeah, counting lines by splitting is not an efficient way. Count the number of line feeds, and add one if the string doesn't end with a line feed.
Use
count()
:To count lines, you can count the number of line breaks:
Edit:
The other answer with built-in
count
is more suitable, actuallyNote the last sentence.
To count lines you can simply count how many
\n
are there:The last part takes into account the last line that do not end with
\n
, even though this means thatHello, World!
andHello, World!\n
have the same line count(which for me is reasonable), otherwise you can simply add1
to the count of\n
.The str.split() method has two algorithms. If no arguments are given, it splits on repeated runs of whitespace. However, if an argument is given, it is treated as a single delimiter with no repeated runs.
In the case of splitting an empty string, the first mode (no argument) will return an empty list because the whitespace is eaten and there are no values to put in the result list.
In contrast, the second mode (with an argument such as
\n
) will produce the first empty field. Consider if you had written'\n'.split('\n')
, you would get two fields (one split, gives you two halves).This first mode is useful when data is aligned in columns with variable amounts of whitespace. For example:
The second mode is useful for delimited data such as CSV where repeated commas denote empty fields. For example:
Note, the number of result fields is one greater than the number of delimiters. Think of cutting a rope. If you make no cuts, you have one piece. Making one cut, gives two pieces. Making two cuts, gives three pieces. And so it is with Python's str.split(delimiter) method:
Yes, there are a couple of easy ways. One uses str.count() and the other uses str.splitlines(). Both ways will give the same answer unless the final line is missing the
\n
. If the final newline is missing, the str.splitlines approach will give the accurate answer. A faster technique that is also accurate uses the count method but then corrects it for the final newline:The signature for str.split is about 20 years old, and a number of the APIs from that era are strictly pragmatic. While not perfect, the method signature isn't "terrible" either. For the most part, Guido's API design choices have stood the test of time.
The current API is not without advantages. Consider strings such as:
When asked to break these strings into fields, people tend to describe both using the same English word, "split". When asked to read code such as
fields = line.split()
orfields = line.split(',')
, people tend to correctly interpret the statements as "splits a line into fields".Microsoft Excel's text-to-columns tool made a similar API choice and incorporates both splitting algorithms in the same tool. People seem to mentally model field-splitting as a single concept even though more than one algorithm is involved.