I often end up writing a bit of code twice when using a loops. For example, while going over the Udacity computer science course, I wrote the code (for a function to find the most sequentially repeated element):
def longest_repetition(l):
if not l:
return None
most_reps = count = 0
longest = prv = None
for i in l:
if i == prv:
count += 1
else:
if count > most_reps:
longest = prv
most_reps = count
count = 1
prv = i
if count > most_reps:
longest = prv
return longest
In this case, I'm checking twice if the count is greater than the previously most repeated element. This happens both when the current element is different from the last and when I've reached the end of the list.
I've also run into this a few times when parsing a string character by character. There have also been a few times where it's been up to about 5 lines of code. Is this common, or a result of the way I think/code. What should I do?
edit: Similarly, in a contrived string splitting example:
def split_by(string, delimeter):
rtn = []
tmp = ''
for i in string:
if i == delimeter:
if tmp != '':
rtn.append(tmp)
tmp = ''
else:
tmp += i
if tmp != '':
rtn.append(tmp)
return rtn
edit: The exam this was from was written for students of the course who are not expected to have any outside knowledge of Python; only what was taught in the previous units. Although I do have prior experience in Python, I'm trying to adhere to these restrictions to get the most of the course. Things like str.split, lists, and a lot of the fundamentals of Python were taught, but nothing yet on imports - especially not things like groupby. That being said, how should it be written without any of the language features that probably wouldn't be taught in a programming introduction course.
Since you tagged language-agnostic
, I see that you wont be much interested in python-specific stuff you could use to make your code efficient, compact and readable. For the same reason, I am not going to show how beautiful a code can be written in python.
In some of the cases that extra if
at the end can be avoided depending on your algorithm, but most cases it's like "If it exists, it should be significant and/or efficient." I dont know about the how the python interpreter works, but in compiled languages like C/C++/etc. the compiler performs various kinds of loop optimisations, including moving the if-blocks out of a loop if it does the same thing.
I ran and compared the running time of various snippets:
- @JFSebastian - 8.9939801693
- @srgerg - 3.13302302361
- yours - 2.8182990551.
It's not a generalisation that a trailing if
gives you the best time. My point is: just follow your algorithm, and try to optimise it. There's nothing wrong with an if
at the end. Probably alternative solutions are expensive.
About the second example you have put in: The check tmp == ''
is done to ensure only non-empty strings are returned. That actually is a sort of additional condition over your splitting algorithm. In any case, you need an additional rtn.append
after the loop because there's still something beyond the last delimiter. You could always push an if condition inside the loop like if curCharIndex == lastIndex: push items to list
which will execute in every iteration, and its sort of the same case again.
My answer in short:
- Your code is as efficient as your algorithm that you have in mind.
- The
if
s in the end are encountered in many cases -- no need to worry about them, they may be making the code more efficient than alternative approaches without such an if (examples are right here).
- Additionally compilers can also spot and modify/move the blocks around your code.
- If there's a language feature/library that makes your code fast and at the same time readable, use it. (Other answers here point out what python offers :))
Have a look at the implementation of itertools.groupby
which does almost exactly what you want. http://docs.python.org/library/itertools.html#itertools.groupby
Here is the algorithm using said code:
from itertools import groupby
string = "AAABBCCDDDD"
maximum = 0
max_char = ""
for i in groupby(string):
x, xs = i
n = len(list(xs))
if n > maximum:
max_char = x
maximum = n
print max_char
My recommendation for thinking about writing algorithms like this in the future is to try not to do everything in one function. Think about the smaller functions that solve the problem you're trying to solve, such as "grouping each sequence of equal items in a sequence into smaller sequences".
Also of course it doesn't have to be characters in the above algorithm -- it could be anything that is groupable.
Edit: In response to the OP's edit, I figured you wouldn't be allowed to use/know about libraries like itertools in a class setting, but I wasn't suggesting that you should rely on external libraries, but more that you should think about problems by splitting them into smaller subproblems. So in this case you'd implement your own groupby
and use it.
Language-agnostic technique to avoid repeating a condition after a loop is to append sentinel values to the input data e.g., if delimiter
appended to the end of string
then the condition is not necessary in split_by()
. Canonical example: in the linear search algorithm a needle can be appended to a haystack to avoid the end of sequence check.
Another option is to delegate some work to a separate function e.g., one function counts number of repetitions, the other finds maximum as in longest_repetition()
:
from itertools import groupby
def longest_repetition(iterable):
return max(groupby(iterable), key=lambda x: sum(1 for _ in x[1]))[0]
If the repeated code is trivial; it might not be worth the effort.
It is not uncommon to need to re-check a condition at the end of a loop that was also being checked inside the loop. If you're prepared to sacrifice a little efficiency, one way to avoid the repeated check is to over-check for it inside the loop. For example:
def my_longest_repetition(l):
if not l:
return None
most_reps = count = 0
longest = prv = None
for i in l:
count = (count + 1) if i == prv else 1
if count > most_reps:
longest = prv
most_reps = count
prv = i
return longest
This code checks for count > most_reps
more often than it needs to, but avoids the need to check it again after the loop.
Unfortunately this kind of change won't be applicable in all circumstances.
I think there are three general approaches that could help you avoid repeating code at the end of the loop. For all three I'm going to use an example problem slightly different from your own, counting words in a string. Here's a "default" version that, like your code, repeats some logic at the end of the loop:
from collections import Counter
def countWords0(text):
counts = Counter()
word = ""
for c in text.lower():
if c not in "abcdefghijklmnopqrstuvwxyz'-":
if word:
counts[word] += 1
word = ""
else:
word += c
if word:
counts[word] += 1 # repeated code at end of loop
return counts
The first approach is to do (some of) the "end of subsequence" processing after every character, so that the bookkeeping is correct if the sequence ends immediately after that character. In your example, you could eliminate the "else" condition on your and run the code within it every time. (This is sergerg's answer.)
This may not be easy for some kinds of checks though. For counting words, you need to add some extra logic to avoid accumulating cruft from the "partial" subsequences you process. Here's code that does that:
def countWords1(text):
counts = Counter()
word = ""
for c in text.lower():
if c not in "abcdefghijklmnopqrstuvwxyz'-":
word = ""
else:
if word:
counts[word] -= 1 # new extra logic
word += c
counts[word] += 1 # this line was moved from above
return counts + Counter() # more new stuff, to remove crufty zero-count items
The second option would be to append a sentinel value to the end of the sequence which will trigger the desired "end of subsequence" behavior. This can be tricky if you need to avoid the sentinel contaminating your data (especially for things like numbers). For your longest consecutive subsequence problem, you could add any value that is not equal to the last item in the sequence. None
might be a good choice. For my counting words example, a non-word character (such as a newline) will do:
def countWords2(text):
counts = Counter()
word = ""
for c in text.lower() + "\n": # NOTE: added a sentinel to the string!
if c not in "abcdefghijklmnopqrstuvwxyz'-":
if word:
counts[word] += 1
word = ""
else:
word += c
# no need to recheck at the end, since we know we ended with a space
return counts
The third approach is to change the structure of the code to avoid iterating over a sequence that might end unexpectedly. You might use generators to preprocess the sequence, as in the other answers that use groupby
from itertools
. (Of course, the generator functions, if you have to write them yourself, may have similar issues.)
For my word counting example, I can use regular expressions from the re
module to find the words:
from re import finditer
def countWords3(text):
return Counter(match.group() for match in
finditer("[\w'-]+", text.lower()))
Output, when given a suitably Pythonic text (it is the same for all four versions of countWords):
>>> text = """Well, there's egg and bacon; egg sausage and bacon;
egg and spam; egg bacon and spam; egg bacon sausage and spam;
spam bacon sausage and spam; spam egg spam spam bacon and spam;
spam sausage spam spam bacon spam tomato and spam;
spam spam spam egg and spam; spam spam spam spam spam spam
baked beans spam spam spam; or Lobster Thermidor a Crevette
with a mornay sauce served in a Provencale manner with shallots
and aubergines garnished with truffle pate, brandy and with a
fried egg on top and spam."""
>>> countWords0(text)
Counter({'spam': 28, 'and': 12, 'egg': 8, 'bacon': 7, 'sausage': 4, 'a': 4,
'with': 4, 'well': 1, 'lobster': 1, 'manner': 1, 'in': 1, 'top': 1,
'thermidor': 1, "there's": 1, 'truffle': 1, 'provencale': 1,
'sauce': 1, 'brandy': 1, 'pate': 1, 'shallots': 1, 'garnished': 1,
'tomato': 1, 'on': 1, 'baked': 1, 'aubergines': 1, 'mornay': 1,
'beans': 1, 'served': 1, 'fried': 1, 'crevette': 1, 'or': 1})
Iterators provide an nice way to break up loops:
def longest_repetition(l):
i=iter(l)
n=next(i,None)
longest=None
most_reps=0
while n is not None:
p=n
count=0
while p==n:
n=next(i,None)
count+=1
if count>most_reps:
most_reps=count
longest=p
return longest
Many languages have a similar concept.