# ParaLearn: A Massively Parallel, Scalable System for Learning Interaction Networks on FPGAs



Christopher W. Fletcher Greg Gibeling Dan Burke John Wawrzynek Narges B. Asadi Eric Glass Karen Sachs Zoey Zhou Wing Wong Garry Nolan



**UC Berkeley** 

**Stanford** 

# Outline

- What is ParaLearn?
  - Introduction: Terminology and objective
  - Motivation: Learning the structure of cell signaling networks
  - Algorithm and architectural overview
- Results
  - Design {scalability, flexibility}
  - End-to-End runtime
- Sources of speedup
- Closing

### Introduction

ParaLearn is a specialized computer for conducting research on interaction networks

- We use software for:
  - 1. Control infrastructure
  - 2. Less computationally intensive steps
- We use hardware (FPGAs) for: Accelerating the algorithm kernel



4 node interaction network: A, B, C, D are **nodes** A and C are **parents** of B {A, C} is the **parent set** of B

3

## **Cell Signaling Networks**

**Goal:** Given flow cytometry 'CyTof' data, learn the structure of cell signaling networks

- Flow Cytometry
  - Data in the form of "raw" quantitative observations
  - Measurement of proteins & other components inside cells

- Cell Signaling Networks
  - Structures that model protein signaling pathways
  - Modeling perturbations to a network can help uncover the cause of human disease



Introduction Approach Architecture Results Optimizations Closing

### Signal Transduction Networks

'STN's: The cell's communication medium



- Carry extra-cellular signals (heat, cold, pressure, etc) throughout the cell
- Span from membrane  $\rightarrow$  nucleus
- Traditional model: Linear/independent protein chains
- Modern model:
  All proteins interact
  in a complex network

### Bayesian Networks



- A basis for comparing Bayesian Structures
- Based on prior belief and observations
  Experimental data

$$P(D,G) = P(G)P(D|G)$$

**Prior probability** 

Graph

**ity** ParaLearn (ICS 2010) Courtesy of Tom Griffiths (U.C. Berkeley)

# Macro Approach

Goal: Determine which network best explains the data

- End-to-end computation
  - (1) Pre-processor: Calculate local scores per parent set
  - (2) "Order Sampler": Determine the high scoring orders (algorithm kernel)
  - (3) "Graph Sampler": Extract graphs from high scoring orders
  - (4) Post-processing: high-level analysis and normalization

- Strategy
  - Implement steps (1) and (4) in software
  - Parallelize (and merge) steps (2) and (3) in hardware

### **Kernel Considerations**

- Learning graph structure is an NP-hard problem
  - Search space grows super-exponentially with the graph's node count
  - Multiple local optima, encoding best-solutions, may exist

| Nodes | Graphs                |
|-------|-----------------------|
| 4     | 453                   |
| 5     | 29281                 |
| 10    | 4.7x10 <sup>17</sup>  |
| 20    | 2.34x10 <sup>72</sup> |

• Order sampler: |order space| < |graph space|



• MCMC sampling, "**Restarts**": jump out of peaks



Micro Approach

Architecture

#### Algorithm

Introduction

Approach

#### **FPGA Implementation**

Results

**Optimizations** 

Closing



Optimizations

Closing

### **FPGA** Core



# Floorplanner

Results

**Optimizations** 

Architecture



Introduction

Approach

Closing

# System Infrastructure

Results

**Optimizations** 

Closing

Interne

Architecture

RCBIOS – Part of GateLib

Introduction

Scalable FPGA  $\leftarrow \rightarrow$  Software communication

Approach

Composed of Verilog, Java, and Apache ANT



Closing

### **FPGA** Array

#### "MCMC Mesh"

Idea: Split larger problems across multiple FPGAs

- \* While maintaining the base design
- Additional Infrastructure
  - (1) Inter-chip ring connections
  - (2) Inter-board Aurora high-speed links
  - (3) **Platform Interconnect Network (PLiN)** built on (1) and (2)



June 2nd





### Scalability

Theme: Given a fixed network, vary hardware resources

#### Performance



#### **CyTof Network**

- 22 nodes
- − 4 indegree  $\rightarrow$  7547 parent sets per node
- \*OPS: "Orders per second"





### Flexibility

#### **Theme:** Given fixed hardware, vary the network



### Scalability + Flexibility

**Theme:** Given different networks, vary hardware resources



### End-to-End Runtime

#### • "Time to graphs" – this includes:

- 1. Pre-processing
- 2. FPGA load time
- 3. FPGA run time (includes order and graph sampling)
- 4. Post-processor (overhead is hidden)
- Over the 22 node CyTof data, varying the number of **restarts**

|                                  | 0 Restarts |      | 50 Restarts |      | 100 Restarts |       |
|----------------------------------|------------|------|-------------|------|--------------|-------|
| Computation Step                 | GPP        | FPGA | GPP         | FPGA | GPP          | FPGA  |
| Pre-processing (Multinomial)     | 185        |      |             | un   |              |       |
| Pre-processing (Linear Gaussian) | 50         |      |             | un   |              |       |
| Load time                        | 0          | 4.5  | un          |      |              |       |
| Order sampler                    | 5.2        | 0.44 | 44.6        | 22.1 | 78.8         | 44.15 |
| Graph sampler                    | 0.85       | 0    | 18.8        | 0    | 31.3         | 0     |
| Total                            | 6.05       | 4.94 | 63.4        | 26.6 | 110.1        | 48.65 |

\* Times in seconds

Closing

# Sources of Speedup & Caching

- Considering FPGA and GPP implementations, which architecture benefits in what ways?
- We have considered...

bitwise optimizations, threading/parallelism, fixed vs. floating point, and caching

Caching

**Insight:** order N  $\rightarrow$  score M

- Used by both GPU & GPP implementations
- Can be made at an order or node granularity
- Caching analytic model
  - DRAM-based order and node caches, modeling: hash function, DRAM latency, hit rate, all-or-nothing node cache hits

| Platform      | OS Time (s) |  |  |
|---------------|-------------|--|--|
| Baseline GPP  | 48500       |  |  |
| Optimized GPP | 44.6        |  |  |
| FPGA 4x       | 22.1        |  |  |

| Setup         | Time (s) |  |  |
|---------------|----------|--|--|
| Baseline FPGA | 22.1     |  |  |
| Order cache   | 21.98    |  |  |
| Node cache    | 8.63     |  |  |
| Both caches   | 8.53     |  |  |
| Speedup       | 2.59x    |  |  |

### Performance Optimizations

- Pre-processing on FPGA
  - (1) "Pre-processing" has become new bottleneck
  - Map "Local score" generation to each FPGA in network
  - Transport "observations" data to FPGAs
  - Insight: Observation files are small, score files are large
- Additional parallelism
  - FPGA load time can overlap with the pre-processing step (hides load time)
  - The MCMC controller can be 'threaded' with multiple restarts (hides result accumulation overhead)

### Closing

This work is an example of how a hand optimized, low-level FPGA design can lead to significant performance and power benefits over conventional processors and GPUs

Algorithm performance is having immediate impact on work done by Biologists who are studying STNs

Our approach's cost: large development and debug time

Our group is currently working on high-level programming abstractions that will ease development effort, ideally without losing efficiency

### Acknowledgements

For making this work possible, the authors thank:

- NIH / Cancer Research Institute Grant #130826-02
- M. Linderman et al.
- Stanford Nolan Lab
- Members of the Berkeley Reconfigurable Computing Group
- UC Berkeley Wireless Research Center (BWRC)
- Contributors to the GateLib research library