Generators in Python

Generators in Python

All new Python programmers stumble upon generators at some point. This is one of the more complex topics in Python. In this article, I will explain what generators are, where they are used, and, most importantly, how to use them.

A brief lecture about evaluating expressions in programming languages

​ Before we can dive into generators in Python, we first must go a bit theoretical, outside of the bounds of Python (or any specific programming language). Let's say we have the piece of code written in pseudo-code:

function square(x):
    return x ^ 2

results = [square(2), square(3), square(4)]

​ How would our program compute the values inside of result? When our program sees a function being called, it instantly calls the function, evaluating the result (the same applies to any expression, for that matter). This seems the most logical way to do an evaluation—calculate the result of an expression instantly at the moment the expression is made. This is what we call "eager" evaluation (which can also be seen as "strict evaluation").

​ Going back to Python for a moment, let's say we want to iterate over this list and print each of the items on the screen. If we rewrite the code above into Python, it would look something like this:

def square(x):
    print(f'Called square with value {x}')
    return x ** 2

results = [square(2), square(3), square(4)]

for result in results:
    print(result)

​ The output of this snippet is:

Called square with value 2
Called square with value 3
Called square with value 4
4
9
16

​ Which is all good, everything works as expected. But what if we wanted to delay the evaluation of square and have it evaluated by the time we need to print it's value, instead of evaluating it when we create the results list ?

​ Let me expand on that - we still want to be able to have an iterable (be it a list or something else), which we can define at point A in the code. We just want to declare that we will want some values evaluated, but we don't want them evaluated until point B in the code.

​ This method of evaluating an expression - not instantly, but at the point we need it - is called lazy evaluation. This evaluation strategy is widely used in functional programming languages like Haskell and also in Python.

Iterator

​ Before going into yield and generators, I want to give a quick refresher on what an iterator is. An iterator is an object which contains a stream of data. It has two important methods - the __iter__ and __next__. The __iter__ method returns the iterator itself, while __next__ returns the next time in the stream. Once __next__ is called, the current item is "consumed", so it cannot be returned by __next__ again. When all of the elements of the iterator have been consumed, a StopIteration exception is raised.

​ Calling __iter__ on a list will return an iterator for that list.

numbers = [4, 24, 59]

iterator = numbers.__iter__()

print(next(iterator))
print(next(iterator))
print(next(iterator))
4
24
59

​ If we call next again, the StopIteration exception will be raised

print(next(iterator))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

Custom generator

​ Now that we know what an iterator is, we can see that our lazy evaluation should behave in a similar way - we should get an object, which will return an item only once, raise StopIteration when there are no more items in the stream, and do the evaluation when requested.

​ Forgetting about what Python already has - with this description, we can actually write our own lazy-evaluated iterator.

from typing import Callable

class LazyEvaluatedIterator:
    def __init__(self, expression: Callable, args: list):
        self.__expression = expression
        self.__args = args

    def __iter__(self):
        return self

    def __next__(self):
        return self.__expression(*self.__args)

​ We can use it, with one minor modification - we must call the next method on our new iterator, to call the actual evaluation of the iterator.

results = [LazyEvaluatedIterator(square, [2]),
           LazyEvaluatedIterator(square, [3]),
           LazyEvaluatedIterator(square, [4])]

for result in results:
    print(next(result))
Called square with value 2
4
Called square with value 3
9
Called square with value 4
16

​ Comparing this output to the previous output, we can see that we delayed the calling of the square function.

Yield and simple generators

​ But how can we implement this in Python? Can we modify our existing square function to turn it into a generator?

​ Obviously, yes - we just have to replace the return keyword with the yield keyword. If we call a function that contains yield, the function will return a generator. With that generator, which is also an iterator, we can call next to get the current item in the generator (in our case, the squared number).

def square(x):
    print(f'Called square with value {x}')
    yield x ** 2

results = [square(2), square(3), square(4)]

for result in results:
    print(next(result))
Called square with value 2
4
Called square with value 3
9
Called square with value 4
16

​ We have achieved the same effect as with our custom generator - delaying the evaluation of the expression until next is called. Job accomplished.

​ But, this generator only evaluates a single item - can we have a generator that yields multiple items ? Let's try to put a few yield statements into a function, and see what happens.

def multiple_yields():
    counter = 1
    print("First step")
    yield counter

    counter += 1
    print("Second step")
    yield counter

    counter += 1
    print("Third step")

    yield counter
    counter += 1
    print("Fourth step")

generator = multiple_yields()
print("Calling the generator")
print(next(generator))
print(next(generator))
print(next(generator))
print(next(generator))
Calling the generator
First step
1
Second step
2
Third step
3
Fourth step
Traceback (most recent call last):
  File "/tmp/python_playground/temp.py", line 22, in <module>
    print(next(generator))
StopIteration

​ Many things are happening in this code, and I want to highlight a few things. From the output, we can deduct that the yield statement "pauses" the function's execution until that point - essentially splitting the function into four pieces (I've formatted the code to make this visible). We can also see that the execution of the function starts once we do the first next call (Notice how "Calling the generator" is before "First step"). On the last next call, the code continues from the previous yield onwards. All the code is executed (hence the "Fourth step"); however, as there is no yield until the end of the function, a StopIteration exception is raised.

​ Fun fact - for loops in Python actually do two things - call iter on the iterable, and call next on the resulting object, storing the result in the variable that we put after the for keyword. (The implications of this being, that we can write our own for loop - but I will leave it for a separate article. If you don't want to miss it, subscribe to my newsletter, so you can get a notification once I release it). This also means that we can put our generator in a for loop, and explore it that way.

generator = multiple_yields()

print("Calling the generator")

for item in generator:
    print(item)
Calling the generator
First step
1
Second step
2
Third step
3
Fourth step

​ The only difference, is that we don't see the StopIteration exception.

​ Now, we saw that we can have multiple yield statements in a generator - but how about putting an yield into a for loop? Continuing on the squared theme, let's create a generator that yields the squares of the numbers between 1 and n (where n will be an argument passed to the generator function).

def squares(n: int):
    for x in range(1, n+1):
        yield x, x ** 2

five_squares = squares(5)

for number, square in five_squares:
    print(f'The square of {number} is {square}')
The square of 1 is 1
The square of 2 is 4
The square of 3 is 9
The square of 4 is 16
The square of 5 is 25

​ One more example - let's create a generator that iterates over a 2D grid, left to right, top to bottom, starting from (0, 0). We will pass the bounds of the grid to the generator function.

def grid_2d(bounds: tuple[int, int]):
    bound_x, bound_y = bounds
    for x in range(bound_x):
        for y in range(bound_y):
            yield (x, y)

grid_3_by_3 = grid_2d((3, 3))

print(list(grid_3_by_3))
[(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2), (2, 0), (2, 1), (2, 2)]

Infinite generators?

​ So far, we've had generators that contain either a fixed amount of yield expressions, or a yield in a for-loop - but can we have an infinite generator? One that can return items forever (or at least until we run out of memory or just stop the program)? Let's give it a try, by writing a generator that returns the next Fibonacci number:

def fibonacci():
    previous = 0
    current = 1

    while True:
        next_number = previous + current
        previous, current = current, next_number

        yield next_number
fib = fibonacci()

print(next(fib))
print(next(fib))
print(next(fib))

​ Executing the code above will give us the following result:

1
2
3

​ But okay, what will happen if we put this generator in a for-loop and print the result?

for num in fibonacci():
    print(num)

​ If you execute this code, your screen will be flooded with numbers as the code generates bigger and bigger Fibonacci numbers. The code will continue to do it until you either interrupt it (Ctrl+C) or run out of resources (or turn off your PC). If you try to convert the generator to a list via list(fibonacci()), the code will seemingly hang, while it tries to fill all available memory with Fibonacci numbers. Again, you can interrupt it or wait until Python runs out of memory.

​ Okay, considering that we can't use the entire contents of an infinite generator, why do they exist? Well, we don't have to use all the items (remember, due to lazy evaluation, the items won't be calculated until we need them). For example, we can write a function that returns the first Fibonacci number which is also a square of a number (technically speaking, the first Fibonacci, that is also a square is 1, but we will ignore it).

import math
def fibonacci():
    previous = 0
    current = 1

    while True:
        next_number = previous + current
        previous, current = current, next_number

        yield next_number


def __is_square(number):
    return int(math.sqrt(number)) ** 2 == number


def get_first_fibonacci_square(fibonaccis):
    next(fibonaccis)
    for number in fibonaccis:
        if __is_square(number):
            return number
fibonaccis = fibonacci()
first_square = get_first_fibonacci_square(fibonaccis)

​ The result of this code will be 144 (as we skipped 1). Here, we stop going through the Fibonacci numbers once we see the first square number, so we are not affected by the fact that our generator is infinite. We can also create another infinite generator from this one, that returns only the even ones.

def even_fibonacci(fibonaccis):
    for number in fibonaccis:
        if number % 2 == 0:
            yield number
fibonaccis = fibonacci()

even_fibonaccis = even_fibonacci(fibonaccis)

print(next(even_fibonaccis))
print(next(even_fibonaccis))
print(next(even_fibonaccis))

​ Here, we will get 2, 8 ,and 34, as they are the first three even Fibonacci numbers.

​ The point of this is to show that we can work with an infinite stream of data - something widely used today. We can write our code so that we focus on the current item, without needing to evaluate all of the items. An excellent example of this is working with big files - we can have files that are multiple gigabytes in size and won't fit in memory, so we can't open them and get all of the content at once. In this situation, we take advantage of the lazy evaluation and focus only on the content we actually need.

Generator expressions

​ So far, we've looked at generators created using the yield keyword. There is actually another way to write generators, using "generator expressions". As the name suggests, these are a single expression that returns a generator.

​ The syntax goes as follows: (f(item) for item in items if p(item)). You may notice that it's similar to list comprehensions (if you want to learn about list comprehensions, you can take a look here). The main difference syntactically is the brackets - list comprehensions use [] , while generator expressions use (). The other big difference is the way list comprehensions and generators are evaluated - list comprehensions are eagerly evaluated, while generator expressions are lazily evaluated.

​ Let's create a generator using generator expressions. As this is a single expression, we can have one expression inside - it could be a function call or just a simple expression. We can take our squares example above and turn it into a generator expression:

def squares(n: int):
    for x in range(1, n+1):
        yield x, x ** 2

five_squares = squares(5)

for number, square in five_squares:
    print(f'The square of {number} is {square}')

​ The first step would be to create a generator that returns the first five squares:

five_squares = ((x, x ** 2) for x in range(1, 6))

for number, square in five_squares:
    print(f'The square of {number} is {square}')
The square of 1 is 1
The square of 2 is 4
The square of 3 is 9
The square of 4 is 16
The square of 5 is 25

​ Now, it's important to note that generator expressions don't accept arguments directly. We can bypass this by either defining the expression in a function or a lambda:

def squares(n: int):
    return ((x, x ** 2) for x in range(1, n + 1))

five_squares = squares(5)
for number, square in five_squares:
    print(f'The square of {number} is {square}')
squares = lambda n: ((x, x ** 2) for x in range(1, n + 1))

five_squares = squares(5)
for number, square in five_squares:
    print(f'The square of {number} is {square}')

​ Both snippets return the same result:

The square of 1 is 1
The square of 2 is 4
The square of 3 is 9
The square of 4 is 16
The square of 5 is 25

Examples

​ Let's look at some examples of generators and their usage in the real world.

​ The most used generator in Python is range. As the name suggests, range generates the numbers in a given range. range uses generators, so we don't load all the numbers in memory.

​ Another example is os.walk. os.walk goes through the directory, top-down (or bottom-up), and yields tuples of the current directory path, the names of directories in the current directory, and the file names of the files under the current directory.

Given the following directory structure:

.
├── file1.txt
├── first
│   ├── fifth
│   ├── file2.txt
│   └── second
│       ├── file3.txt
│       └── file4.txt
├── fourth
├── second
│   └── sixth
│       └── seventh
└── third
    └── file5.txt

8 directories, 5 files

Looking at what os.walk returns, we see it's a generator object:

import os

walker = os.walk('/tmp/example')
print(walker)
<generator object _walk at 0x7fc26b3fa960>

We can call next on it to get the contents of the starting directory.

print(next(walker))
('/tmp/example', ['third', 'fourth', 'first', 'second'], ['file1.txt'])

We see the resulting tuple - the current directory, the directories in it, and the files in the directory. To view all files and directories, we can loop through the generator.

walker = os.walk('/tmp/example')
for dirname, dirnames, filenames in walker:
    print(f'{dirname} contains the following directories: {dirnames}, and the following files: {filenames}')
/tmp/example contains the following directories: ['third', 'fourth', 'first', 'second'], and the following files: ['file1.txt']
/tmp/example/third contains the following directories: [], and the following files: ['file5.txt']
/tmp/example/fourth contains the following directories: [], and the following files: []
/tmp/example/first contains the following directories: ['fifth', 'second'], and the following files: ['file2.txt']
/tmp/example/first/fifth contains the following directories: [], and the following files: []
/tmp/example/first/second contains the following directories: [], and the following files: ['file3.txt', 'file4.txt']
/tmp/example/second contains the following directories: ['sixth'], and the following files: []
/tmp/example/second/sixth contains the following directories: ['seventh'], and the following files: []
/tmp/example/second/sixth/seventh contains the following directories: [], and the following files: []

Conclusion

​ Lazy evaluation is a powerful but often underutilized tool in programming languages. It gives programmers the ability to delay the evaluation of a given item until it's actually needed. With this, we can work with "infinite" collections without the need to evaluate every single item. Python allows lazy evaluation via generators, either using the yield keyword or using generator expressions.

​ I hope you found this article helpful! If you did, make sure to leave a like. If you want to read more Python, click here. If you want to read something shorter, click here. As always, happy coding!