Getting started with Python for R developers
If you work in R and are curious about using Python, but haven’t quite known how to get started, this blog post is for you. I’ve come across quite a few R users who want to try Python, but just haven’t had the time to figure which packages they need to know about and which coding environment to use.
In this post, I’ll walk through very basic versions of everything you need to know to get set up with statistical computing in Python, making direct comparisons to R and the R ecosystem wherever possible.
By the end of this post, you should be upandrunning with a programming environment. Towards that end, I’ll skip over lots of Python language features and packages that you might find useful later on, though where relevant I’ll link out to more extensive documentation. My hope is that this post will provide enough of a foundation that you’ll know where to look to build on that foundation for more advanced use cases.
I’ll assume familiarity with R and basic familiarity with the command line, but no familiarity with Python.
Installing Python (via Anaconda)
The easiest way to get set up with Python^{1} is to install Anaconda. For our purposes, you can think of Anaconda as a batteriesincluded distribution of Python. In particular, it includes numpy, scipy, jupyter, and many other useful packages, some of which will make an appearance later on.
To install anaconda, visit their download page and follow the instructions for your operating system. Their installer will take a while to download, and then it will take a while to run (as it’s downloading all the relevant packages). At some point near the end, it will ask you:
Be sure to answer yes
. This will mean that as soon as you open a new terminal
window, any command line references to python
, jupyter
, or any other
anacondacontained package will get sent to your new anaconda installation.
Writing Python code in a Jupyter Notebook
Jupyter notebooks^{2} provide an incredible interactive environment for coding in python. Without getting into too many details, jupyter notebooks let you run and edit python code in your browser. They allow you to write markdown and latex alongside your code, and display any graphics you generate. The browser renders them as nicelooking documents, and you can export them to PDF or HTML files. The nearest analog in R world is probably rmarkdown, with some differences that will become apparent.
To get started, create a new directory for your Python code. Navigate
to that directory in your terminal, and type jupyter notebook
. This will spin
up a new jupyter notebook server that you can access at localhost:8888
on your
machine. You can then create a new notebook on that server. This will create an
*.ipynb
file in your file system, which represents the notebook you’re editing
in the browser.
To shut down the jupyter notebook server, go back over to your terminal and hit
Ctrc Ctrc
.
Below is a GIF in which I go through all these steps, and show the .ipynb
file
that has been created.
You’ll note at the end that the something pops up in the browser that says that the connection failed. That’s because we’ve shut down the jupyter server, so there’s nothing to connect to anymore.
Now you know how to get a jupyter notebook up and running. We’ll encounter more features of the jupyter notebook as we move through the rest of the post.
Lists
The most basic python data structure is called a list
. Much like in R, the
Python list can be composed of elements of different types, and just represents
an ordered collection.
I’ve heard R folks say that you can’t vectorize operations in Python, and they may have gotten that idea from trying something like this on lists:
What’s happening here is that python is just repeating your list two times, which is quite a reasonable interpretation of the multiplication operator on lists. Incidentally, R’s response to trying to multiply a list by a number is:
Okay, okay. But what we really care about is multiplying an R vector by a number:
To vectorize operations in Python, we have to introduce the numpy package. Numpy “arrays” are roughly the equivalent of R vectors.
Loading packages
First, a quick aside. Python is a general purpose programming language. For that reason, it doesn’t automatically load lots of statistical libraries that many Python users won’t ever need (and, likewise, it doesn’t load lots of webdevelopment libraries that you probably wouldn’t use every day). That means we’ll have to load packages to get a lot of the functionality that’s available in R by default.
To load a package in Python, we use an import statement:
This is roughly equivalent to loading a package in R with:
In this case, there are two differences worth pointing out. First, when you load
a package in R, it dumps all the functions that package exports into your global
namespace. In other words, if you load, for example, the MASS
library, and you
then want to call the glm.nb
function, you can specify that that function
came from the MASS
library with MASS::glm.nb(...)
. But you don’t have to;
you can also just call glm.nb(...)
on it’s own.
Python behaves differently. In order to minimize naming collisions, you have to
refer to functions in a particular package with the name of that package^{3}. So
for example, if we loaded numpy with import numpy
, to get at the array
function, we’d have to type numpy.array(...)
.
Because typing long package names gets tedious, you can alias the package with
something shorter. In this case, by using import numpy as np
, we’re telling
python that whenever we type np
, we really mean numpy
, so we can call
numpy’s array function with np.array(...)
.
All of the packages I’ll reference in this post have developed communityaccepted aliases over time, and I’ll use them in this post.
Numpy arrays (like R vectors)
Now that we’ve loaded the numpy
package, we can create numpy arrays.
These behave a lot like R vectors in most cases.
To build a simple numpy array, we can just pass a list to numpy’s array function:
Let’s look at just one more case in which numpy arrays behave like R vectors:
So, generally speaking, when you think you need an R vector, reach for a numpy array.
Many of the numerical functions that are always around in R can be found in numpy. So to take the mean of a numpy array, we can use numpy’s mean function.
The same is true for sample variance, standard deviation, calculating covariance, etc. If you need to do anything with calculating values from arrays, just google “How do I X with numpy”.
Alternatively this guide to numpy for R users seems to cover a lot of ground.
A Jupyter aside
Go back to your jupyter notebook try some of this out. Boot up your jupyter
notebook server as before (with jupyter notebook
), and start running code.
You can copy the python code above into the first cell of the notebook, then hit
Shift+Enter
to run that piece of code. I’ll demonstrate with a GIF below.
Matrices
Matrices in python are just numpy arrays with 2 dimensions. We can create a matrix by passing the numpy array function a list of lists. Each inner list will be taken to be one row of the matrix.
You can multiply matrices with the @
sign.
Generating sequences
Two very useful numpy functions for generating sequences are np.linspace and np.arange.
These both correspond to R’s seq
function.
linspace
accepts a start and a stop parameter, and optionally the length of
the array to return. By default, it will return an array of length 50. linspace
is like using seq
with the length
parameter.
arange
corresponds to using seq
with the just a starting and stopping point,
or with the by
parameter.
Plotting
The core plotting library in Python is matplotlib. Let’s use it to plot some stuff. Because we haven’t learned to read in data yet, let’s go back to numpy to generate some data.
We’ll use numpy’s sin function along with the sequencegenerating functions above to plot a sin wave. Almost all, if not all, vectorized functions available in base R (sin, exp, log, etc) are available in numpy as well.
First, we have to import matplotlib. Next, to get the plot to show up inline in your jupyter notebook, we need to type a special incantation as follows:
Next, let’s generate our x values from 0 to $2 \pi$, and plot both a sign wave and linear function of those x values against themselves.
You can also easily plot scatter plots by using plt.scatter
and histograms
with plt.hist
.
If you’re looking for the kind of higherlevel plotting interface you may be used to from R, take a look at the seaborn package.
Dataframes
Python dataframes are implemented in the pandas package, which also provides the easiest way to read data into python.
The pandas documentation provides a relatively extensive overview of how pandas dataframes compare to R dataframes, and how to accomplish the same things in pandas that you might do with common R packages.
For our purposes, let’s just read in some data. The pandas read_csv function will, as you may expect, read a csv file into a pandas dataframe. In this case, we’ll have it read a csv file from a URL, but it can also of course read a csv file given a path to the file.
We’ll use data from 538’s github: https://raw.githubusercontent.com/fivethirtyeight/data/master/mostcommonname/newtopfirstNames.csv
We can read in the data as follows:
We use the index_col
parameter because the data we’re loading already has an
index column at it’s first column, so we don’t want pandas to add another one,
which it does by default.
The reason we use 0
to indicate the first column instead of 1
is that unlike
R, but like almost every other programming language, python is
zero indexed.
This will be relevant any time you want to select data by index.
Great, now we have the data. Let’s say we only want names that more than .5% of the population have. One way to subset the data is to create a mask, as you might do in R.
Here, the data['newPerct2013'] > .005
expression is creating an array of
True and False values, and we’re then using that to select only the rows in
data
that correspond to True values in that array.
The increasingly preferred way to query in pandas, though, is to use the query method:
Now let’s say we want to know total percent of the population that had one of the hundred most popular baby names.
Naturally there are many other things we might want to do with data frames. Apart from the Roriented guide linked to above, the pandas documentation is really quite good.
Statistical functions
The equivalents of statistical functions like dnorm
, qnorm
are located in
the scipy.stats
package.
For each of the many supported distributions, this package provides functions to generate random variables, compute the distribution, pdf and log pdf at any point, calculate qunatiles, and more. Just like in R, these functions are vectorized.
This holds for just about any distribution, not just the normal distribution. Most distributions in scipy also offer more methods. See, for example, the documentation about the normal distribution.
Statistical models
Finally, many of the statistical models built in to R and spread across various packages can be found in the python statsmodels package. We’ll take a look at a poisson loglinear model.
First, let’s generate the data.
Now, we can take a look at our generated data with plt.plot(x, y)
.
Next, let’s load up statsmodels. One thing to note is that unlike R’s glm, statsmodels doesn’t automatically fit a model with an intercept. Fortunately it provides a utility function for adding a constant to a design matrix, which we’ll use to fit a model with an intercept.
Now we can look at the fit with fit.summary()
. The fit object has many of the
same properties are R’s fit objects tend to have, including deviance, residuals,
the model’s estimates for the design matrix, etc. It also has a predict
method
for predicting outcomes for new observations.
For people coming from R, the statsmodels formula api provides a very Rlike experience (including fitting intercepts by default). I’d recommend this guide to transitioning from R to statsmodels.
Getting help
Jupyter notebooks will let you view the documentation for a function if you write the function name follow by a question mark, and then run the code cell.
For example
will get the documentation for numpy’s linspace function to pop up.
Jupyter notebooks also have tab completion, so just tabbing out after typing the name of a package will give you a pretty good overview of which functions are available to you.
If you’re more comfortable in the terminal and don’t want to mess with the
browser at all, you can get a very good python repl by typing ipython
into the
terminal. Everything explored in this post will work there as well.
More generally, there are a tremendous number of resources out there that you can use to go deeper into many of these topics.
Conclusion
Python and R both have a lot to offer to folks who are interested in statistical computing. I hope this post helps lower the Python barriertoentry for R developers. My goal was to provide a general overview for which tools might make an R developer feel most at home in a python setting, and provide a basis for exploring the Python ecosystem for statistical computing. If you have any advice on how to make it more effective, or notice anything I got wrong, please get in touch!

You may be aware that there’s a divide in the Python ecosystem between Python 2 and Python 3. Up until a few years ago, Python 3 was not universally supported by scientific computing packages. That’s no longer the case. At this point in time, the only reason to use Python 2 is that you’ll have to interact with a legacy system that does not yet support Python 3. For the rest of this post, I’ll assume use of Python 3. ↩

Formerly know as IPython notebooks, their name changed when they moved to support more languages. The jupyter ecosystem does now support R as well, but in practice I haven’t seen many Rusing statisticians pulled in that direction. ↩

This is not strictly true. It is possible to import all the objects from any package with the syntax
from some_package import *
, but in almost all cases this is very bad practice. ↩