I’ve recently started using the starmap function in python quite frequently in my analyses. I think it has several advantages including readability, flexability, and above all, very simple parallelism. In this post I’ll cover what the starmap function is, how to use it in two common cases, and how it becomes trivial to parallelize starmapped code.

The starmap function

The starmap function is available in the itertools core module. It works like this:

def my_add(a, b):
    return a + b

nums = [(1, 2), (3, 4), (5, 6)]

sums = starmap(my_add, nums)
# [3, 7, 11]

In other words, it lets you map over a list of tuples with a function that takes more than one argument. It unpacks each tuple into the function call, so that the first element of the tuple is passed as the first parameter, and so forth.

It’s called starmap (probably the best named function in python) because behind the scenes it’s using the splat operator, which is spelled * in python.

It’s worth noting that map and friends got relegated to itertools in python 3 because they are not particularly ‘pythonic’. The idiomatic way to code the example above in python is:

sums = [my_add(*n) for n in nums]

Or, for cases like this where there are only two or three elements per tuple:

sums = [my_add(a, b) for a, b in nums]

The primary advantage to using starmap is that it’s trivial to make parallel. The multiprocessing core module has a Pool class, which represents a pool of worker processes. That class has a starmap method which functions like the itertools version, except that different chunks of the list can get processed in parallel.

To spread the computation over three cores, we just need to add a few lines:

from multiprocessing import Pool

with Pool(3) as p:
    sums = p.starmap(my_add, nums)

The secondary reason to use starmap over list comprehensions is aesthetic. For folks coming from a functional programming background, the map family of functions may seem more natural and concise. In my opinion starmap gets rid of a lot of unnecessary syntax. It takes a little while to get used to reading it, especially in the context of dataframes, but it quickly becomes quite intuitive.

Starmapping with Pandas: Apply and Groupby

The pandas dataframe groupby method returns an iterable of tuples, where the first element is the value of the group and the second is a dataframe of all rows that fall into that group.

The usual pandas approach is to then apply an aggregating function that, for each column, somehow aggregates all the rows in that group to a single value. Pandas allows you to apply a different function on each column and / or define your own aggregating function.

But if you want to do anything more complicated - like using multiple columns in the aggregation step - you’ll need to use apply on the grouped object. Pandas will then call the function you pass to apply on every dataframe in the grouped object.

There are two issues you may encounter with this. First, there’s no native pandas way to parallelize it. Second, pandas decides how to do the ‘combine’ step for you based on the nature of your apply statement. I find some of the decisions pandas makes here to be counter intuitive, and prefer to have control over the combination step. For more information on this, read the documentation on pandas’ take on the split-apply-combine paradigm. Starmap gives you all the flexibility of apply, with more control and parallelism.

Here’s an example. Let’s say you have data about a number of basketball teams and as part of the analysis you want to fit a model to each team separately for comparison. Under the starmap pattern, that might look like this:

from multiprocessing import Pool, cpu_count

teams = basketball_data.groupby('team_name')

def fit_team_model(team_name, team_data):
    # do some fancy model fitting here
    return model

with Pool(cpu_count() - 1) as pool:
    models = pool.starmap(fit_team_model, teams)

I recently encountered a scenario like this. Training each model took about 1 minute, and I had ~ 120 groups. I’ve been doing most of my analysis on a machine with 64 cores recently. Writing the code in a way that makes it super easy to parallelize the model-fitting step is the difference between waiting two hours and two minutes between making changes to the model.

starmap can also act as a replacement for the apply method of a pandas dataframe. If you need to apply over a single column, it is easy enough to do something like my_df['my_column'].apply(my_func).

If you need to apply over the rows of a dataframe and use multiple columns, you’ll need to do something like my_df.apply(my_func, axis=1), and my_func will be called once with each row as a series object.

Let’s go back to the basketball data example, and say that we want to add a column called 'point_diff'.

With apply, we would do:

def calc_point_diff(row):
    return abs(row.team1_score - row.team2_score)

bball_df['point_diff'] = bball_df.apply(calc_point_diff, axis=1)

With starmap, we would do:

def calc_point_diff(team1_score, team2_score):
    return abs(team1_score - team2_score)

scores = bball_df[['team1_score', 'team2_score']]).values
with Pool(cpu_count() - 1) as p:
    bball_df['point_diff'] = p.starmap(calc_point_diff, scores)

The upside to the starmap techinique, in addition to it being parallel is that our function can have a descriptive function signature. Instead of taking a generic row series and accessing its elements throughout the function, we can name the parameters at the start, which makes the function easier to read and use (especially in more complicated settings).

The downside to the starmap pattern is that we have to prepare the scores data to be starmapped over. In this example we select the score columns out into their own dataframe and then get that dataframe as a 2D numpy array from its values attribute. This gives us an array of arrays, where each internal array is made up of the two scores. Each of these gets unpacked into a calc_point_diff call.

Conclusion

Since I’ve started using the starmap pattern, I’ve found that I’m writing parallel code much earlier in the analysis process, which makes it easier to quickly iterate on new ideas. I’d highly recommend it.

It goes without saying that all the usual pitfalls of parallelism apply here as well - you have to be careful of race conditions, mutable state, etc. It’s worth noting, though, that parallelizing with starmap on multiple levels generally will not help. If, for example, you parallelize fitting a model to groups as above, it will generally not help to then also parallelize the algorithm that you use to fit the model (unless you still have plenty of cores to spare or are working on a large computing cluster). The good news is that that’s usually much harder anyway.

The last observation I would make is that it’s easiest to apply the starmap pattern when everything you want to starmap over is in a dataframe. I recently heard Jenny Bryan give a talk arguing that it’s easier to manage ‘rectangular’ data, so even things that you wouldn’t traditionally think of as belonging in a single cell of a dataframe - like models that have been fit, or lists - should go into one. While dataframes are not the right structure for every kind of analysis, I have found that keeping things in a dataframe does seem to cause parallel starmap solutions to appear more often than I would otherwise expect.