Beware the Python generators

Generators and list comprehension in Python are very closely related. After all, each gives you an iterable. However, I really wish that generators came with a "DANGER: Handle with care!" label. The problem is that while the syntax for creating a generator vs a list differs in exactly two characters, generators have side effects that are both subtle and easy to overlook. Let's take a look at some code:

items = get_items()

for x in items:
    print x

for x in items:
    print x

Does that code look reasonable to you? It does, until I tell you that get_items() returns a generator. You see, generators have an internal pointer to the index and cannot be reset. Thus the sequence of items will only be printed once. This can be solved by convention. Some libraries will prefix the function's name with an "i" converting get_items to iget_items(). The built-in Python function xrange(), a generator version of range(), is another example of trying to solve this problem with convention.

Let's look at another piece of code:

try:
    numbers = (int(x) for x in line.split(','))
except ValueError:
    numbers = [] # Handle the case where the input is invalid

for num in numbers:
   print num

That looks reasonably good, no? Well, generators are lazy, they have to be. Thus number = ... line defines the generator, but not a single call to int() is made at that point. The calls to int() are made during the iteration, while the for loop is executing, which is outside of the try/except block. There are several solutions that exist here, ranging from using a list comprehension instead to placing the for loop inside the try/except block.

Another difference between lists/tuples/sets/other sequence types and generators is that generators have no length. Calling len(get_items()) would result in an error. This is by design: generators may be infinite, and thus it does not make sense to ask what their length is.

I love generators as much as the next guy. However, I think care must be taken when using them. My rules of thumb are:

First, if you are using a generator to optimize for speed: don't. In casual observation they are indeed faster than lists, but lists are so flipping fast already that unless you are processing millions of items, it will make no difference. Exception to this rule is when you are in fact processing millions of items or you routinely need to create a lot of iterables in your hot loop and your profiler tells you that this is the bottleneck.

Second, if you are optimizing for memory usage, use generators only if you have a significant number of records. A list of 100 ints will make little difference. A list of ten million log entries is going to cost you some RAM.

Third, never return a generator from a library method, or any type of opaque object. Generators should mostly be used for intermediate iterables until a final result is obtained. Avoid the confusion of get_items() returning a generator.

Lastly, use generators if you must. They are the only way to create infinite iterables, and they do have small speed and large memory advantages over other iterables. If you use them, put in several safeguards: make sure to document the fact that a generator object is used in multiple places, create a convention for what these objects (and the corresponding functions) will be called and test, test, test. As I mentioned at the beginning of this post: generators should come with a warning to not surprise unsuspecting maintainers of your code.