## Intro to Parallel Processing for Data Analysis ### JHU Biostatistics Computing Club 12 December 2016
## Outline - Quick overview of CPUs and computer memory - What is a process? - Levels of parallelism - Dangers of parallel code - Parallel code in practice
## Learning about computers - Source code (e.g. written in R) is turned into machine code - CPUs know how to execute operations described by machine code - CPUs need to access data, which is stored elsewhere on the system (or on a different system) - CPUs are like a grain mill, computer memory is like the stream - Need to work together
## What is a CPU? ![](images/intro_parallel_code/cpu_block.gif) - Memory access time increases dramatically - Cache (L1, L2, L3) - Ram - Storage: SSD (better) / HD (worse) image source: http://iwats.ecsquest.info/CBX/images/cpu_block.gif
## CPU access to memory - Thinking about CPU caches becomes relevant for optimizing performance - Start thinking about the fact that all processors that run parallel code will need access to relevant data. - shared memory - sharding data - message passing - etc.
## What is a core? - Most consumer machines have a single processor but multiple cores. ![](images/intro_parallel_code/multi_core.jpg) - Cores share L2 cache, processors share RAM, system memory. - Unless you're dealing with a systems language (C, Rust), you mostly do not think about this, but good to know when you need performance. image source: https://i.cmpnet.com/embedded/insights/2007/01-07/IntelTianFig1.jpg
## What is a process? - Construct of the operating system - Each process has its own process id (pid), `environment` and `virtual memory` - `environment` contains variables set by user or OS that may differ between execution platforms - `virtual memory` is an abstraction over RAM and the SSD / HD - By default, most programs will execute as a single process - One process will always run on only one core
## What is a thread? - A single CPU will alternate between threads: - If fast enough, looks like they are running in parallel - The OS will treat a core with hyperthreading as two cores. - Processes can contain multiple threads - Threads, unlike processes, can share memory and other system resources
## Scheduling - Whole field of CS devoted to optimizing OS execution of processes and threads subject to the constraints of machine resources (RAM, cores, etc) - Which process goes to which core? - A single CPU will alternate between threads: - If fast enough, looks like they are running in parallel - The OS will treat a core with hyperthreading as two cores. - Processes can contain multiple threads - Threads, unlike processes, can share memory and other system resources
## Scheduling: Example - Process P1 is executing, suddenly makes IO call and is waiting for data from database. Scheduler swaps in P2. ![](images/intro_parallel_code/swapping.jpg) image source: https://www.tutorialspoint.com/operating_system/os_process_scheduling.htm
## Processes and Parallelism - Processes can spawn new processes - key tool for writing parallel programs - Children inherent parent's environments, sometimes share resources ![](images/intro_parallel_code/spawn.jpg) image source: https://www.cs.uic.edu/~jbell/CourseNotes/OperatingSystems/3_Processes.html
## Monitoring systems resources with htop ![](images/intro_parallel_code/htop_out.png) image source: http://www.deonsworld.co.za/2012/12/20/understanding-and-using-htop-monitor-system-resources/
## Levels of Parallelism - **Embarassingly Parallel** - Easiest, most applicable to data analysis - We know the whole problem at the start and can set up our parallel processes accordingly - Not suited to live / interactive systems - Algorithmic Parallelism - Figuring out how to run a single algorithm on multiple cores - Hardware level parallelsim - Distributed Systems
## Caveats about parallelism - Overhead of spawning and managing processes can outweight gains for lightweight problems - Parallelizing on more than on level can lead to locks - E.g. if you are fitting 1e6 models in 64 processes on a 64 core machine, it will only hurt to _also_ parallelize code that fits an individual model - Parallelizing at higher levels is (almost) always easier - E.g. easier to fit many models in parallel than to figure out how to parallelize the fitting of an single model
## Dangers of parallelism - Race conditions! - There is no guarantee about the order in which parallelized code will be executed or completed. - If you count on one task to finish before another, you expose yourself to race conditions - Hardest class of bugs to debug - not reproducible, dependent on system resources, etc. - Corrupting shared memory - If your processes share objects in RAM (or other physical storage) manipulating memory from one process can disturb the others.
## Solutions to parallelism - Use immutable data structures - If your data structures cannot be changed, your processes cannot corrupt them for each other. - Use pure functions - If your functions have no side-effects, much harder to encounter race conditions
## Embarassing Parallelism 1.0 - Run your program manually on a number of subsets of your data ``` > python myprogram datafile1.csv & > python myprogram datafile2.csv & ... > python myprogram datafile3.csv ``` - Requires sharding data ahead of time - Super not fun

Parallelism 2.0: The Dulcimer Approach ™

  • How I program:
image source: http://larkinam.com/Dulcimers.html
## Parallelism 2.0: The Dulcimer Approach ™ - Start with single thread of execution - Branch out into multiple processes as needed - Come back to single thread - Cache results often! - `pickle` in python - csv dumps in anything
## Parallel R code: clusters - Marjor library: `parallel` - To prepare workers for your parallel tasks, use `makeCluster` ```.R makeCluster(detectCores() - 1, type='FORK') ``` - Type `'FORK'` creates workers by forking the `R` process. - Other types let you create a cluster across multiple nodes
## Parallel R code: apply functions ```.R clusterCall(cl = NULL, fun, ...) clusterApply(cl = NULL, x, fun, ...) clusterApplyLB(cl = NULL, x, fun, ...) clusterEvalQ(cl = NULL, expr) clusterExport(cl = NULL, varlist, envir = .GlobalEnv) clusterMap(cl = NULL, fun, ..., MoreArgs = NULL, RECYCLE = TRUE, SIMPLIFY = FALSE, USE.NAMES = TRUE, .scheduling = c("static", "dynamic")) clusterSplit(cl = NULL, seq) parLapply(cl = NULL, X, fun, ...) parSapply(cl = NULL, X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE) parApply(cl = NULL, X, MARGIN, FUN, ...) parRapply(cl = NULL, x, FUN, ...) parCapply(cl = NULL, x, FUN, ...) parLapplyLB(cl = NULL, X, fun, ...) parSapplyLB(cl = NULL, X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE) ```
## Parallel python code: clusters - Major library: `multiprocessing` - Use `Pool` class to to prepare workers ```.py import multiprocessing pool = Pool(multiprocessing.cpu_count() - 1) ``` - Extensively [documented](https://docs.python.org/3.5/library/multiprocessing.html) on python site.
## Examples: python ```.py def fit_models_by_zipcode(df): groups = df.groupby('zip_code') with multiprocessing.Pool(7) as pool: return pool.starmap(fit_model, groups) ```
## Examples: R ```.R fit_models_by_zipcode = function(df) { zip_codes = unique(df$zip_code) groups = vector('list', length(zip_codes)) for (i in 1:length(zip_codes)) { z = zip_codes[i] groups[[i]] = list(z, df[df$zip_code == z, ]) } cluster = makeCluster(7, type='FORK') return(parSapply(cluster, groups, fit_model)) } ```
## More complex cluster management: - Hadoop (map-reduce implementation) - Spark (higher level) - Really only necessary for _big_ data and interactive systems, not so much for analysis
## Further Reading This is barely the tip of the iceberg. [Introduction to Parallel Computing](https://computing.llnl.gov/tutorials/parallel_comp/) [High Performance Python](http://shop.oreilly.com/product/0636920028963.do)