If you work in R and are curious about using Python, but haven’t quite known how to get started, this blog post is for you. I’ve come across quite a few R users who want to try Python, but just haven’t had the time to figure which packages they need to know about and which coding environment to use.
In this post, I’ll walk through very basic versions of everything you need to know to get set up with statistical computing in Python, making direct comparisons to R and the R ecosystem wherever possible.
By the end of this post, you should be up-and-running with a programming environment. Towards that end, I’ll skip over lots of Python language features and packages that you might find useful later on, though where relevant I’ll link out to more extensive documentation. My hope is that this post will provide enough of a foundation that you’ll know where to look to build on that foundation for more advanced use cases.
I’ll assume familiarity with R and basic familiarity with the command line, but no familiarity with Python.
Installing Python (via Anaconda)
The easiest way to get set up with Python1 is to install Anaconda. For our purposes, you can think of Anaconda as a batteries-included distribution of Python. In particular, it includes numpy, scipy, jupyter, and many other useful packages, some of which will make an appearance later on.
To install anaconda, visit their download page and follow the instructions for your operating system. Their installer will take a while to download, and then it will take a while to run (as it’s downloading all the relevant packages). At some point near the end, it will ask you:
Be sure to answer
yes. This will mean that as soon as you open a new terminal
window, any command line references to
jupyter, or any other
anaconda-contained package will get sent to your new anaconda installation.
Writing Python code in a Jupyter Notebook
Jupyter notebooks2 provide an incredible interactive environment for coding in python. Without getting into too many details, jupyter notebooks let you run and edit python code in your browser. They allow you to write markdown and latex alongside your code, and display any graphics you generate. The browser renders them as nice-looking documents, and you can export them to PDF or HTML files. The nearest analog in R world is probably rmarkdown, with some differences that will become apparent.
To get started, create a new directory for your Python code. Navigate
to that directory in your terminal, and type
jupyter notebook. This will spin
up a new jupyter notebook server that you can access at
localhost:8888 on your
machine. You can then create a new notebook on that server. This will create an
*.ipynb file in your file system, which represents the notebook you’re editing
in the browser.
To shut down the jupyter notebook server, go back over to your terminal and hit
Below is a GIF in which I go through all these steps, and show the
that has been created.
You’ll note at the end that the something pops up in the browser that says that the connection failed. That’s because we’ve shut down the jupyter server, so there’s nothing to connect to anymore.
Now you know how to get a jupyter notebook up and running. We’ll encounter more features of the jupyter notebook as we move through the rest of the post.
The most basic python data structure is called a
list. Much like in R, the
Python list can be composed of elements of different types, and just represents
an ordered collection.
I’ve heard R folks say that you can’t vectorize operations in Python, and they may have gotten that idea from trying something like this on lists:
What’s happening here is that python is just repeating your list two times, which is quite a reasonable interpretation of the multiplication operator on lists. Incidentally, R’s response to trying to multiply a list by a number is:
Okay, okay. But what we really care about is multiplying an R vector by a number:
To vectorize operations in Python, we have to introduce the numpy package. Numpy “arrays” are roughly the equivalent of R vectors.
First, a quick aside. Python is a general purpose programming language. For that reason, it doesn’t automatically load lots of statistical libraries that many Python users won’t ever need (and, likewise, it doesn’t load lots of web-development libraries that you probably wouldn’t use every day). That means we’ll have to load packages to get a lot of the functionality that’s available in R by default.
To load a package in Python, we use an import statement:
This is roughly equivalent to loading a package in R with:
In this case, there are two differences worth pointing out. First, when you load
a package in R, it dumps all the functions that package exports into your global
namespace. In other words, if you load, for example, the
MASS library, and you
then want to call the
glm.nb function, you can specify that that function
came from the
MASS library with
MASS::glm.nb(...). But you don’t have to;
you can also just call
glm.nb(...) on it’s own.
Python behaves differently. In order to minimize naming collisions, you have to
refer to functions in a particular package with the name of that package3. So
for example, if we loaded numpy with
import numpy, to get at the
function, we’d have to type
Because typing long package names gets tedious, you can alias the package with
something shorter. In this case, by using
import numpy as np, we’re telling
python that whenever we type
np, we really mean
numpy, so we can call
numpy’s array function with
All of the packages I’ll reference in this post have developed community-accepted aliases over time, and I’ll use them in this post.
Numpy arrays (like R vectors)
Now that we’ve loaded the
numpy package, we can create numpy arrays.
These behave a lot like R vectors in most cases.
To build a simple numpy array, we can just pass a list to numpy’s array function:
Let’s look at just one more case in which numpy arrays behave like R vectors:
So, generally speaking, when you think you need an R vector, reach for a numpy array.
Many of the numerical functions that are always around in R can be found in numpy. So to take the mean of a numpy array, we can use numpy’s mean function.
The same is true for sample variance, standard deviation, calculating covariance, etc. If you need to do anything with calculating values from arrays, just google “How do I X with numpy”.
Alternatively this guide to numpy for R users seems to cover a lot of ground.
A Jupyter aside
Go back to your jupyter notebook try some of this out. Boot up your jupyter
notebook server as before (with
jupyter notebook), and start running code.
You can copy the python code above into the first cell of the notebook, then hit
Shift+Enter to run that piece of code. I’ll demonstrate with a GIF below.
Matrices in python are just numpy arrays with 2 dimensions. We can create a matrix by passing the numpy array function a list of lists. Each inner list will be taken to be one row of the matrix.
You can multiply matrices with the
These both correspond to R’s
linspace accepts a start and a stop parameter, and optionally the length of
the array to return. By default, it will return an array of length 50. linspace
is like using
seq with the
arange corresponds to using
seq with the just a starting and stopping point,
or with the
The core plotting library in Python is matplotlib. Let’s use it to plot some stuff. Because we haven’t learned to read in data yet, let’s go back to numpy to generate some data.
We’ll use numpy’s sin function along with the sequence-generating functions above to plot a sin wave. Almost all, if not all, vectorized functions available in base R (sin, exp, log, etc) are available in numpy as well.
First, we have to import matplotlib. Next, to get the plot to show up inline in your jupyter notebook, we need to type a special incantation as follows:
Next, let’s generate our x values from 0 to $2 \pi$, and plot both a sign wave and linear function of those x values against themselves.
You can also easily plot scatter plots by using
plt.scatter and histograms
If you’re looking for the kind of higher-level plotting interface you may be used to from R, take a look at the seaborn package.
Python dataframes are implemented in the pandas package, which also provides the easiest way to read data into python.
The pandas documentation provides a relatively extensive overview of how pandas dataframes compare to R dataframes, and how to accomplish the same things in pandas that you might do with common R packages.
For our purposes, let’s just read in some data. The pandas read_csv function will, as you may expect, read a csv file into a pandas dataframe. In this case, we’ll have it read a csv file from a URL, but it can also of course read a csv file given a path to the file.
We’ll use data from 538’s github: https://raw.githubusercontent.com/fivethirtyeight/data/master/most-common-name/new-top-firstNames.csv
We can read in the data as follows:
We use the
index_col parameter because the data we’re loading already has an
index column at it’s first column, so we don’t want pandas to add another one,
which it does by default.
The reason we use
0 to indicate the first column instead of
1 is that unlike
R, but like almost every other programming language, python is
This will be relevant any time you want to select data by index.
Great, now we have the data. Let’s say we only want names that more than .5% of the population have. One way to subset the data is to create a mask, as you might do in R.
data['newPerct2013'] > .005 expression is creating an array of
True and False values, and we’re then using that to select only the rows in
data that correspond to True values in that array.
The increasingly preferred way to query in pandas, though, is to use the query method:
Now let’s say we want to know total percent of the population that had one of the hundred most popular baby names.
Naturally there are many other things we might want to do with data frames. Apart from the R-oriented guide linked to above, the pandas documentation is really quite good.
The equivalents of statistical functions like
qnorm are located in
For each of the many supported distributions, this package provides functions to generate random variables, compute the distribution, pdf and log pdf at any point, calculate qunatiles, and more. Just like in R, these functions are vectorized.
This holds for just about any distribution, not just the normal distribution. Most distributions in scipy also offer more methods. See, for example, the documentation about the normal distribution.
Finally, many of the statistical models built in to R and spread across various packages can be found in the python statsmodels package. We’ll take a look at a poisson log-linear model.
First, let’s generate the data.
Now, we can take a look at our generated data with
Next, let’s load up statsmodels. One thing to note is that unlike R’s glm, statsmodels doesn’t automatically fit a model with an intercept. Fortunately it provides a utility function for adding a constant to a design matrix, which we’ll use to fit a model with an intercept.
Now we can look at the fit with
fit.summary(). The fit object has many of the
same properties are R’s fit objects tend to have, including deviance, residuals,
the model’s estimates for the design matrix, etc. It also has a
for predicting outcomes for new observations.
For people coming from R, the statsmodels formula api provides a very R-like experience (including fitting intercepts by default). I’d recommend this guide to transitioning from R to statsmodels.
Jupyter notebooks will let you view the documentation for a function if you write the function name follow by a question mark, and then run the code cell.
will get the documentation for numpy’s linspace function to pop up.
Jupyter notebooks also have tab completion, so just tabbing out after typing the name of a package will give you a pretty good overview of which functions are available to you.
If you’re more comfortable in the terminal and don’t want to mess with the
browser at all, you can get a very good python repl by typing
ipython into the
terminal. Everything explored in this post will work there as well.
More generally, there are a tremendous number of resources out there that you can use to go deeper into many of these topics.
Python and R both have a lot to offer to folks who are interested in statistical computing. I hope this post helps lower the Python barrier-to-entry for R developers. My goal was to provide a general overview for which tools might make an R developer feel most at home in a python setting, and provide a basis for exploring the Python ecosystem for statistical computing. If you have any advice on how to make it more effective, or notice anything I got wrong, please get in touch!
You may be aware that there’s a divide in the Python ecosystem between Python 2 and Python 3. Up until a few years ago, Python 3 was not universally supported by scientific computing packages. That’s no longer the case. At this point in time, the only reason to use Python 2 is that you’ll have to interact with a legacy system that does not yet support Python 3. For the rest of this post, I’ll assume use of Python 3. ↩
Formerly know as IPython notebooks, their name changed when they moved to support more languages. The jupyter ecosystem does now support R as well, but in practice I haven’t seen many R-using statisticians pulled in that direction. ↩
This is not strictly true. It is possible to import all the objects from any package with the syntax
from some_package import *, but in almost all cases this is very bad practice. ↩