I’ve recently started using the
starmap function in python quite frequently
in my analyses. I think it has several advantages
including readability, flexability, and above all, very simple parallelism.
In this post I’ll cover what the starmap function is, how to use it in two
common cases, and how it becomes trivial to parallelize starmapped code.
The starmap function
The starmap function is available in the
itertools core module. It works like
In other words, it lets you
map over a list of tuples with a function that
takes more than one argument. It unpacks each tuple into the function call,
so that the first element of the tuple is passed as the first parameter, and
It’s called starmap (probably the best named function in python) because behind
the scenes it’s using the splat operator, which is spelled
* in python.
It’s worth noting that
map and friends got relegated to
python 3 because they are not particularly ‘pythonic’. The idiomatic way to
code the example above in python is:
Or, for cases like this where there are only two or three elements per tuple:
The primary advantage to using
starmap is that it’s trivial to make parallel.
multiprocessing core module has a
Pool class, which represents a pool
of worker processes. That class has a
starmap method which functions like the
itertools version, except that different chunks of the list can get
processed in parallel.
To spread the computation over three cores, we just need to add a few lines:
The secondary reason to use starmap over list comprehensions is aesthetic. For
folks coming from a functional programming background, the
map family of
functions may seem more natural and concise. In my opinion
starmap gets rid
of a lot of unnecessary syntax. It takes a little while to get used to reading
it, especially in the context of dataframes, but it quickly becomes quite
Starmapping with Pandas: Apply and Groupby
The pandas dataframe
groupby method returns an iterable of tuples, where the
first element is the value of the group and the second is a dataframe of all
rows that fall into that group.
The usual pandas approach is to then apply an aggregating function that, for each column, somehow aggregates all the rows in that group to a single value. Pandas allows you to apply a different function on each column and / or define your own aggregating function.
But if you want to do anything more complicated - like using
multiple columns in the aggregation step - you’ll need to use
apply on the
grouped object. Pandas will then call the function you pass to apply on every
dataframe in the grouped object.
There are two issues you may encounter with this. First, there’s no native
pandas way to parallelize it. Second, pandas decides how to do the ‘combine’
step for you based on the nature of your apply statement. I find some of the
decisions pandas makes here to be counter intuitive, and prefer to have control
over the combination step. For more information on this, read the documentation
on pandas’ take on the
paradigm. Starmap gives you all the flexibility of
apply, with more control
Here’s an example. Let’s say you have data about a number of basketball teams and as part of the analysis you want to fit a model to each team separately for comparison. Under the starmap pattern, that might look like this:
I recently encountered a scenario like this. Training each model took about 1 minute, and I had ~ 120 groups. I’ve been doing most of my analysis on a machine with 64 cores recently. Writing the code in a way that makes it super easy to parallelize the model-fitting step is the difference between waiting two hours and two minutes between making changes to the model.
starmap can also act as a replacement for the
apply method of a pandas dataframe.
If you need to
apply over a single column, it is easy enough to do something
If you need to apply over the rows of a dataframe and use multiple columns,
you’ll need to do something like
my_df.apply(my_func, axis=1), and
will be called once with each row as a series object.
Let’s go back to the basketball data example, and say that we want to add a
With apply, we would do:
With starmap, we would do:
The upside to the
starmap techinique, in addition to it being parallel
is that our function can have a descriptive function signature. Instead of
taking a generic
row series and accessing its elements throughout the
function, we can name the parameters at the start, which makes the function
easier to read and use (especially in more complicated settings).
The downside to the
starmap pattern is that we have to prepare the scores
data to be starmapped over. In this example we select the score columns out
into their own dataframe and then get that dataframe as a 2D numpy array from
values attribute. This gives us an array of arrays, where each internal
array is made up of the two scores. Each of these gets unpacked into a
Since I’ve started using the starmap pattern, I’ve found that I’m writing parallel code much earlier in the analysis process, which makes it easier to quickly iterate on new ideas. I’d highly recommend it.
It goes without saying that all the usual pitfalls of parallelism apply here as well - you have to be careful of race conditions, mutable state, etc. It’s worth noting, though, that parallelizing with starmap on multiple levels generally will not help. If, for example, you parallelize fitting a model to groups as above, it will generally not help to then also parallelize the algorithm that you use to fit the model (unless you still have plenty of cores to spare or are working on a large computing cluster). The good news is that that’s usually much harder anyway.
The last observation I would make is that it’s easiest to apply the
pattern when everything you want to starmap over is in a dataframe. I recently
heard Jenny Bryan give a talk
arguing that it’s easier to manage ‘rectangular’ data, so even things that you
wouldn’t traditionally think of as belonging in a single cell of a dataframe - like
models that have been fit, or lists - should go into one. While dataframes are
not the right structure for every kind of analysis, I have found that keeping
things in a dataframe does seem to cause parallel
starmap solutions to
appear more often than I would otherwise expect.