If you work in R and are curious about using Python, but haven’t quite known how to get started, this blog post is for you. I’ve come across quite a few R users who want to try Python, but just haven’t had the time to figure which packages they need to know about and which coding environment to use.

In this post, I’ll walk through very basic versions of everything you need to know to get set up with statistical computing in Python, making direct comparisons to R and the R ecosystem wherever possible.

By the end of this post, you should be up-and-running with a programming environment. Towards that end, I’ll skip over lots of Python language features and packages that you might find useful later on, though where relevant I’ll link out to more extensive documentation. My hope is that this post will provide enough of a foundation that you’ll know where to look to build on that foundation for more advanced use cases.

I’ll assume familiarity with R and basic familiarity with the command line, but no familiarity with Python.

Installing Python (via Anaconda)

The easiest way to get set up with Python1 is to install Anaconda. For our purposes, you can think of Anaconda as a batteries-included distribution of Python. In particular, it includes numpy, scipy, jupyter, and many other useful packages, some of which will make an appearance later on.

To install anaconda, visit their download page and follow the instructions for your operating system. Their installer will take a while to download, and then it will take a while to run (as it’s downloading all the relevant packages). At some point near the end, it will ask you:

Do you wish the installer to prepend the Anaconda install location to PATH in
your /root/.bashrc ? [yes|no]

Be sure to answer yes. This will mean that as soon as you open a new terminal window, any command line references to python, jupyter, or any other anaconda-contained package will get sent to your new anaconda installation.

Writing Python code in a Jupyter Notebook

Jupyter notebooks2 provide an incredible interactive environment for coding in python. Without getting into too many details, jupyter notebooks let you run and edit python code in your browser. They allow you to write markdown and latex alongside your code, and display any graphics you generate. The browser renders them as nice-looking documents, and you can export them to PDF or HTML files. The nearest analog in R world is probably rmarkdown, with some differences that will become apparent.

To get started, create a new directory for your Python code. Navigate to that directory in your terminal, and type jupyter notebook. This will spin up a new jupyter notebook server that you can access at localhost:8888 on your machine. You can then create a new notebook on that server. This will create an *.ipynb file in your file system, which represents the notebook you’re editing in the browser.

To shut down the jupyter notebook server, go back over to your terminal and hit Ctr-c Ctr-c.

Below is a GIF in which I go through all these steps, and show the .ipynb file that has been created.

open-jupyter

You’ll note at the end that the something pops up in the browser that says that the connection failed. That’s because we’ve shut down the jupyter server, so there’s nothing to connect to anymore.

Now you know how to get a jupyter notebook up and running. We’ll encounter more features of the jupyter notebook as we move through the rest of the post.

Lists

The most basic python data structure is called a list. Much like in R, the Python list can be composed of elements of different types, and just represents an ordered collection.

I’ve heard R folks say that you can’t vectorize operations in Python, and they may have gotten that idea from trying something like this on lists:

> my_list = [1, 2, 3]
> my_list * 2
# [1, 2, 3, 1, 2, 3]

What’s happening here is that python is just repeating your list two times, which is quite a reasonable interpretation of the multiplication operator on lists. Incidentally, R’s response to trying to multiply a list by a number is:

> my_list = list(1, 2, 3)
> my_list * 2
Error in my_list * 2 : non-numeric argument to binary operator

Okay, okay. But what we really care about is multiplying an R vector by a number:

> my_vector = c(1, 2, 3)
> my_vector * 2
[1] 2 4 6

To vectorize operations in Python, we have to introduce the numpy package. Numpy “arrays” are roughly the equivalent of R vectors.

Loading packages

First, a quick aside. Python is a general purpose programming language. For that reason, it doesn’t automatically load lots of statistical libraries that many Python users won’t ever need (and, likewise, it doesn’t load lots of web-development libraries that you probably wouldn’t use every day). That means we’ll have to load packages to get a lot of the functionality that’s available in R by default.

To load a package in Python, we use an import statement:

import numpy as np

This is roughly equivalent to loading a package in R with:

library(some_package)

In this case, there are two differences worth pointing out. First, when you load a package in R, it dumps all the functions that package exports into your global namespace. In other words, if you load, for example, the MASS library, and you then want to call the glm.nb function, you can specify that that function came from the MASS library with MASS::glm.nb(...). But you don’t have to; you can also just call glm.nb(...) on it’s own.

Python behaves differently. In order to minimize naming collisions, you have to refer to functions in a particular package with the name of that package3. So for example, if we loaded numpy with import numpy, to get at the array function, we’d have to type numpy.array(...).

Because typing long package names gets tedious, you can alias the package with something shorter. In this case, by using import numpy as np, we’re telling python that whenever we type np, we really mean numpy, so we can call numpy’s array function with np.array(...).

All of the packages I’ll reference in this post have developed community-accepted aliases over time, and I’ll use them in this post.

Numpy arrays (like R vectors)

Now that we’ve loaded the numpy package, we can create numpy arrays. These behave a lot like R vectors in most cases.

To build a simple numpy array, we can just pass a list to numpy’s array function:

import numpy as np
my_array = np.array([1, 2, 3])
my_array * 2
# array([2, 4, 6])

Let’s look at just one more case in which numpy arrays behave like R vectors:

first = np.array([1, 2, 3])
second = np.array([10, 11, 12])
first + second
# array([11, 13, 15])

So, generally speaking, when you think you need an R vector, reach for a numpy array.

Many of the numerical functions that are always around in R can be found in numpy. So to take the mean of a numpy array, we can use numpy’s mean function.

np.mean(my_array)
# 2.0

The same is true for sample variance, standard deviation, calculating covariance, etc. If you need to do anything with calculating values from arrays, just google “How do I X with numpy”.

Alternatively this guide to numpy for R users seems to cover a lot of ground.

A Jupyter aside

Go back to your jupyter notebook try some of this out. Boot up your jupyter notebook server as before (with jupyter notebook), and start running code.

You can copy the python code above into the first cell of the notebook, then hit Shift+Enter to run that piece of code. I’ll demonstrate with a GIF below.

use jupyter

Matrices

Matrices in python are just numpy arrays with 2 dimensions. We can create a matrix by passing the numpy array function a list of lists. Each inner list will be taken to be one row of the matrix.

You can multiply matrices with the @ sign.

first = np.array([[1, 2], [3, 4], [5, 6]])
second = np.array([[1, 2, 3], [4, 5, 6]])
first @ second
# array([[ 9, 12, 15],
#       [19, 26, 33],
#       [29, 40, 51]])

Generating sequences

Two very useful numpy functions for generating sequences are np.linspace and np.arange.

These both correspond to R’s seq function.

linspace accepts a start and a stop parameter, and optionally the length of the array to return. By default, it will return an array of length 50. linspace is like using seq with the length parameter.

np.linspace(0, 1, num=8)
# array([ 0.        ,  0.33333333,  0.66666667,  1.        ])
seq(0, 1, length=8)
#  [1] 0.0000000 0.3333333 0.6666667 1.0000000

arange corresponds to using seq with the just a starting and stopping point, or with the by parameter.

np.arange(0, 10)
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

np.arange(0, 10, step=3)
# array([0, 3, 6, 9])
seq(0, 10)
# [1]  0  1  2  3  4  5  6  7  8  9 10

seq(0, 10, by=3)
# [1] 0 3 6 9

Plotting

The core plotting library in Python is matplotlib. Let’s use it to plot some stuff. Because we haven’t learned to read in data yet, let’s go back to numpy to generate some data.

We’ll use numpy’s sin function along with the sequence-generating functions above to plot a sin wave. Almost all, if not all, vectorized functions available in base R (sin, exp, log, etc) are available in numpy as well.

First, we have to import matplotlib. Next, to get the plot to show up inline in your jupyter notebook, we need to type a special incantation as follows:

import matplotlib.pyplot as plt
%matplotlib inline

Next, let’s generate our x values from 0 to $2 \pi$, and plot both a sign wave and linear function of those x values against themselves.

x = np.linspace(0, 2 * np.pi)
y = np.sin(x)
plt.plot(x, y)
plt.plot(x, x)

plot jupyter

You can also easily plot scatter plots by using plt.scatter and histograms with plt.hist.

If you’re looking for the kind of higher-level plotting interface you may be used to from R, take a look at the seaborn package.

Dataframes

Python dataframes are implemented in the pandas package, which also provides the easiest way to read data into python.

The pandas documentation provides a relatively extensive overview of how pandas dataframes compare to R dataframes, and how to accomplish the same things in pandas that you might do with common R packages.

For our purposes, let’s just read in some data. The pandas read_csv function will, as you may expect, read a csv file into a pandas dataframe. In this case, we’ll have it read a csv file from a URL, but it can also of course read a csv file given a path to the file.

We’ll use data from 538’s github: https://raw.githubusercontent.com/fivethirtyeight/data/master/most-common-name/new-top-firstNames.csv

We can read in the data as follows:

import pandas as pd
url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/most-common-name/new-top-firstNames.csv"
data = pd.read_csv(url, index_col=0)

We use the index_col parameter because the data we’re loading already has an index column at it’s first column, so we don’t want pandas to add another one, which it does by default.

The reason we use 0 to indicate the first column instead of 1 is that unlike R, but like almost every other programming language, python is zero indexed. This will be relevant any time you want to select data by index.

Great, now we have the data. Let’s say we only want names that more than .5% of the population have. One way to subset the data is to create a mask, as you might do in R.

data[data['newPerct2013'] > .005]

Here, the data['newPerct2013'] > .005 expression is creating an array of True and False values, and we’re then using that to select only the rows in data that correspond to True values in that array.

The increasingly preferred way to query in pandas, though, is to use the query method:

data.query('newPerct2013 > .005')

Now let’s say we want to know total percent of the population that had one of the hundred most popular baby names.

data['newPerct2013'].sum()

intro pandas

Naturally there are many other things we might want to do with data frames. Apart from the R-oriented guide linked to above, the pandas documentation is really quite good.

Statistical functions

The equivalents of statistical functions like dnorm, qnorm are located in the scipy.stats package.

For each of the many supported distributions, this package provides functions to generate random variables, compute the distribution, pdf and log pdf at any point, calculate qunatiles, and more. Just like in R, these functions are vectorized.

import scipy.stats as st 
st.norm.rvs(5, 10, size=100)  # same as rnorm(100, 5, 10)
st.norm.ppf([.025, .975], 5, 10)  # same as qnorm(c(.025, .975), 5, 10)

x = np.linspace(0, 10)
st.norm.pdf(x, 5, 10)  # same as dnorm(x, 5, 10)
st.norm.cdf(x, 5, 10)  # same as pnorm(x, 5, 10)

This holds for just about any distribution, not just the normal distribution. Most distributions in scipy also offer more methods. See, for example, the documentation about the normal distribution.

Statistical models

Finally, many of the statistical models built in to R and spread across various packages can be found in the python statsmodels package. We’ll take a look at a poisson log-linear model.

First, let’s generate the data.

x = np.linspace(0, 5)
alpha = 5
beta = 0.3
eta = alpha + beta * x
y = st.poisson.rvs(np.exp(eta))

Now, we can take a look at our generated data with plt.plot(x, y).

Next, let’s load up statsmodels. One thing to note is that unlike R’s glm, statsmodels doesn’t automatically fit a model with an intercept. Fortunately it provides a utility function for adding a constant to a design matrix, which we’ll use to fit a model with an intercept.

import statsmodels.api as sm
model = sm.GLM(y, sm.add_constant(x), family=sm.families.Poisson())
fit = model.fit()

Now we can look at the fit with fit.summary(). The fit object has many of the same properties are R’s fit objects tend to have, including deviance, residuals, the model’s estimates for the design matrix, etc. It also has a predict method for predicting outcomes for new observations.

statsmodels

For people coming from R, the statsmodels formula api provides a very R-like experience (including fitting intercepts by default). I’d recommend this guide to transitioning from R to statsmodels.

Getting help

Jupyter notebooks will let you view the documentation for a function if you write the function name follow by a question mark, and then run the code cell.

For example

np.linspace?

will get the documentation for numpy’s linspace function to pop up.

Jupyter notebooks also have tab completion, so just tabbing out after typing the name of a package will give you a pretty good overview of which functions are available to you.

If you’re more comfortable in the terminal and don’t want to mess with the browser at all, you can get a very good python repl by typing ipython into the terminal. Everything explored in this post will work there as well.

More generally, there are a tremendous number of resources out there that you can use to go deeper into many of these topics.

Conclusion

Python and R both have a lot to offer to folks who are interested in statistical computing. I hope this post helps lower the Python barrier-to-entry for R developers. My goal was to provide a general overview for which tools might make an R developer feel most at home in a python setting, and provide a basis for exploring the Python ecosystem for statistical computing. If you have any advice on how to make it more effective, or notice anything I got wrong, please get in touch!


  1. You may be aware that there’s a divide in the Python ecosystem between Python 2 and Python 3. Up until a few years ago, Python 3 was not universally supported by scientific computing packages. That’s no longer the case. At this point in time, the only reason to use Python 2 is that you’ll have to interact with a legacy system that does not yet support Python 3. For the rest of this post, I’ll assume use of Python 3. 

  2. Formerly know as IPython notebooks, their name changed when they moved to support more languages. The jupyter ecosystem does now support R as well, but in practice I haven’t seen many R-using statisticians pulled in that direction. 

  3. This is not strictly true. It is possible to import all the objects from any package with the syntax from some_package import *, but in almost all cases this is very bad practice.