Blog

My python generator died and all I got was this stupid blog article

couple your generators with context managers!

As part of my AI Powered Search chapters, I’m cleaning up Hello LTR’s Python API to make the code more readable as book examples. A big part of the API is working with search training data, known as judgments, mapping keywords to documents along with a grade (4 relevant, 0 irrelevant)

You can imagine a CSV file:

Keywords,document,grade
Rambo,First Blood,4
Rambo,Rambo III,3
Rambo,Chocolat,0
Batman,Batman Begins,4
Batman,Catwoman,3

I need to scan over large amounts of these three tuples to gather features for each line (stuff like the keyword’s TF*IDF score for the movie’s description, the movie’s release date, etc). All this would ultimately be converted into a complete training set for a machine learning library.

Enter the generators – along with a bug

How to do this?

We might eagerly loop through the file, and parse our judgments like so:

def judgments_from_file(f):
    judgments=[]
    for line in f:
        judgments.append(parse_row(line))
    return judgments

Then we can:

judgments=[]
with open('judgments.txt') as f:
    # Get the judgments from the file...
    judgments = judgments_from_file(f)

And sometime later:

gather_features(judgments)


train_model(judgments)

And this all works hunky dory. But of course I’d rather not load all that data into memory until I go to use it. I’d rather use a Python generator. Something like:

def judgments_from_file(f):
    for line in f:
        yield parse_row(line)

The generator will now only pull data when we need it.

However, Can you spot the bug we introduced by changing judgments_from_file to be a generator?

Here’s a hint, somewhere in gather_features we will use the judgments argument:

for j in judgments:
    process(j)

Now spot the bug? judgments is now a generator, not a list. The judgments_from_file code runs lazily. Much later than the eager version at the for j in judgments: line. The first thing judgments_from_file does is try to loop the lines in the file (for line in f). And BLAMMO! Here f is now a closed file. It’s been closed since we exited the with block much earlier in the code.

I’ve been bitten by this mistake a few times now in multiple situations. It’s made me annoyed at Python generators. It seems to betray a concept known as uniformity of access. To quote Bertrand Meyer:

All services offered by a module should be available through a uniform notation, which does not betray whether they are implemented through storage or through computation

I feel the consumer shouldn’t need to care if it’s a lazily generated iterator or a list in memory. It should work the same. Of course, the world is never so perfect, and my brain hurts thinking too hard about variable lifetimes and language design.

Context Managers to the Rescue

I realized, after reading Fluent Python that perhaps I could solve this by tying the lifetime of my judgments directly to the with block. We can create our own Context Managers with custom behavior on entering and exiting the context (the with block)

In other words, instead of

with open('judgments.txt') as f:
   ....

I might more safely do:

with judgments_open('judgments.txt') as judgments
     gather_features(judgments)
     train_model(judgments)

In this judgments_open function I can place into this with block the judgments variable. This is the generator from above. Tying it to a with block gives a solid contract to the programmer on that variable’s lifetime.

How to do this? Turns out it’s pretty easy! With a little wrapper over judgments_from_file generator, I can do this pretty easily when aided with the contextmanager decorator:

@contextmanager
def judgments_open(path=None):
    """ Read judgments from the filesystem"""
    try:
        f=open(path, 'r')
        yield judgments_from_file(f) #<- 'with' runs to here, this becoming the var tied to with block
    finally:
        f.close() #<- run after done the 'with' context (or there's an exception)

On the with keyword, judgments_open will run up to yield yielding the return value of judgments_from_file to the context (this becomes the judgments variable in with judgments_open(..) as judgments) Then finally when we’re all done in the with block (or there’s an exception) the rest of judgments_open is run – closing the file.

Now of course, I could still do something intentionally stupid. Like try to save off a reference to judgments and do further work with it. But I know I’ll be shooting myself in the foot – as its lifetime is clearly scoped to the with block.

The Lesson: Lazy Generators <3 Context Managers

Our data is often lazy generated from a source – a file, a socket, or a database. These sources under the hood have their own lifetime to be managed. We can’t just willy-nilly return generators and exchange them with an eagerly-generated list. We need to manage the generator’s lifetime. The easiest way to do this is always couple your generators with context managers!

In conclusion, next time you want reach for a generator, think about whether you really should also be reaching for a context manager ;).