Hardware-aware Automated AI for Efficient Deep Learning across Hybrid Deployments: Current Landscape and Future Directions

Dr. Kaoutar El Maghraoui Principal Research Scientist AI Hardware Center Testbed Leader IBM Research AI

March 28, 2021 @FASTPATH 2021



## **The Need for Efficient Deep Learning**

#### The Accelerating complexity of AI Models

The number of parameters in neural networks models is increasing on the order of 10x year on year.

#### **GROWING MODEL COMPLEXITY -> RAPIDLY INCREASING COMPUTE**



IBM Research AI / © 2021 IBM Corporation

#### **Unbounded computational demands**

# Training requirements are doubling every 3.5 months



#### AlexNet to AlphaGo Zero: A 300,000x Increase in Compute

## **The Need for Efficient Deep Learning**

#### The Accelerating complexity of AI Models

The number of parameters in neural networks models is increasing on the order of 10x year on year.

#### **GROWING MODEL COMPLEXITY** $\rightarrow$ **RAPIDLY INCREASING COMPUTE**



#### **Increased Carbon Footprint**

#### Training a single model can emit as much as carbos as 5 cars in their lifetimes



Source: https://www.technologyreview.com/2019/06/06/239031/training-a-single-ai-model-can-emit-as-much-carbon-as-five-cars-in-their-lifetimes/



## Inference at the Edge



ΙοΤ

**100** mW

(< few 10 GOps)

< few mm<sup>2</sup>

Single AI Core

Lower accuracy permissible



Mobile

250 mW to <2W

(< 100's of GOps)

5-10 mm<sup>2</sup>

Few AI Cores

Accuracy important



**Automotive** 

20 - 50W

(10's - 100's of TOps)

**100 – 250 mm**<sup>2</sup>

<u>Multiple</u> AI Cores+ Custom Interconnect

No loss of accuracy is acceptable

For Inference, across different domains, <u>TOp per Watt</u> is the key metric Larger Model => More Memory References => More Energy

## AI Hardware Trends





McKinsey&Company | Source: Expert interviews; McKinsey analysis

#### AI ASICs expect to have the biggest growth



#### The optimal compute architecture varies by use case

]

Core 2





## AI Hardware Trends



#### AI ASICs expect to have the biggest growth



#### The optimal compute architecture varies by use case



Core 2



# Building Efficient Neural Networks



Design accurate and efficient neural networks

#### Hardware Efficiency



#### Purpose-built AI hardware

#### Design Efficiency



Automate the design of efficient neural networks

#### Model Efficiency



#### Design accurate and efficient neural networks

IBM Research © 2020 IBM Corporation

Popular Approaches

#### Pruning Deep Neural Networks



[Lecun et al. NIPS'89] [Han et al. NIPS'15]

#### **Reduced Precision**



J. Choi at al, NeurIPS 2019

#### **Compact Convolution Filters**



#### **Knowledge Distillation**



Hinton et al, arXiv:1503.02531

# Building Efficient Neural Networks



Design accurate and efficient neural networks

#### Hardware Efficiency



#### Purpose-built AI hardware

#### Design Efficiency



Automate the design of efficient neural networks

Building Efficient Neural Networks

## Hardware Efficiency



Purpose-built AI hardware

## IBM Research AI Hardware Center

"IBM invests \$2 Billion in New York Research Hub for AI"

## **Bloomberg**

"IBM Bets \$2B Seeking 1000X AI Hardware Performance Boost"



IBM Research AI /  $\,$  © 2021 IBM Corporation

An ecosystem of enterprise and academic partners

# February 7, 2019

Launch Date

\$2B

IBM Investment To Create Artificial Intelligence Hardware Center

\$300M New York State investment

16 and growing

Members of the IBM Research AI Hardware Center

## **IBM AI Hardware Center Focus**



#### IBM Hybrid Cloud Infrastructure

## IBM Research's Roadmap for AI Hardware

## Hardware Efficiency



Purpose-built AI hardware Extending performance by 2.5X / year through 2025

Approximate computing principles applied to Digital AI Cores with reduced precision,

as well as Analog AI Cores, which could potentially offer another 100x in energy-efficiency



T. Gokmen and Y. Vlasov, Frontiers in Neuroscience 10, pp. 333, 2016

## IBM Research's Roadmap for AI Hardware

## Hardware Efficiency



Purpose-built AI hardware

IBM Research AI / © 2021 IBM Corporation

Extending performance by 2.5X / year through 2025

Approximate computing principles applied to Digital AI Cores with reduced precision,

as well as Analog AI Cores, which could potentially offer another 100x in energy-efficiency



T. Gokmen and Y. Vlasov, Frontiers in Neuroscience 10, pp. 333, 2016

#### IBM AI Hardware Center Digital AI Cores: IBM Research is leading in reduced precision scaling with iso accuracy



- Key advancements in reduced precision arithmetic for AI driven by IBM AI Research team.
- First demonstration of 16-bit precision for Deep Learning Training (ICML 2015).
- Demonstration of world's first 8-bit training (NeurIPS 2018, NeurIPS 2019), and world's first 4-bit training (NeurIPS 2020).
- Demonstration of highly accurate 2-bit and 4-bit Inference (SysML 2019)
- ISSCC 2021 : IBM introduced the AI chip chip leading in precision scaling (8-bit training and 4-bit inference)

https://www.ibm.com/blogs/research/2021/02/ai-chip-precision-scaling/

## IBM Research's Roadmap for AI Hardware

## Hardware Efficiency



Purpose-built AI hardware Extending performance by 2.5X / year through 2025

Approximate computing principles applied to Digital AI Cores with reduced precision,

as well as Analog AI Cores, which could potentially offer another 100x in energy-efficiency



T. Gokmen and Y. Vlasov, Frontiers in Neuroscience 10, pp. 333, 2016

## Analog NVM for in-memory compute

Eliminate the Von-Neumann bottleneck

Perform computation directly in memory

Map DNNs to analog cross-point arrays

NVM materials in array crosspoints to store weights







## Analog AI Fully-connected layer with resistive crossbar array



Addition: Kirchhoff's law Multiplication: Ohm's

Backward pass:  $\mathbf{d}^{l-1} = W^T \mathbf{d}^l$  $V_{in}$ 0  $t \propto d_{1}$ RPU  $V_{in}$ 0  $t_2 \propto d_2$ RPU RPU  $V_{in}$ 0.  $t_i \propto d_i$  $C_{int}$ t<sub>meas</sub> ADC Vout

(backward only needed for training)

W. Haensch et al., *Proc. IEEE*, **107**(1), 108-122 (2018) Gokmen & Vlasov. *Front. In Neurosci.* (2016)

#### https://analog-ai.mybluemix.net/

## IBM Analog Hardware Acceleration Kit



# Analog AI Hardware Acceleration Toolkit

https://analog-ai.mybluemix.net/



#### IBM Research AI / © 2021 IBM Corporation

#### **Current Capabilities Include:**

- Simulate analog MACC operation including analog backward/update pass
- Simulate a wide range of analog AI devices and crossbar configurations by using abstract functional models of material characteristics with adjustable parameters
- Abstract device (update) models
- Analog friendly learning rule
- Hardware-aware training for inference capability
- Inference capability with drift and statistical (programming) noise models

#### Install Analog AI Hardware Acceleration Kit

\$ pip install aihwkit

#### Training your Analog Model

from torch import Tensor
from torch.nn.functional import mse\_loss

# Import the aihwkit constructs.
from aihwkit.nn import AnalogLinear
from aihwkit.optim.analog\_sgd import AnalogSGD

x = Tensor([[0.1, 0.2, 0.4, 0.3], [0.2, 0.1, 0.1, 0.3]]) y = Tensor([[1.0, 0.5], [0.7, 0.3]])

# Define a network using a single Analog layer. model = AnalogLinear(4, 2)

# Use the analog-aware stochastic gradient descent optimizer. opt = AnalogSGD(model.parameters(), lr=0.1) opt.regroup\_param\_groups(model)

```
# Train the network.
for epoch in range(10):
    pred = model(x)
    loss = mse_loss(pred, y)
    loss.backward()
```

opt.step()
print('Loss error: {:.16f}'.format(loss))

# Analog Hardware Acceleration Toolkits

https://analog-ia.mybluemix.net/

#### Open Source Github Library

Released Oct 2020 Target AI and hardware developers, ecosystem building







A roadmap of evolving features to grow the open-source Analog AI ecosystem

# Building Efficient Neural Networks



**Purpose-built AI** hardware: **Reduced Precision Digital AI Cores** and Analog

**Design accurate** and efficient neural networks: pruning, quantization, etc.

Model

Efficiency

#### Design Efficiency



Automate the design of efficient neural networks

IBM Research IBM Research AI / © 2021 IBM Corporation

© 2020 IBM Corporation

## Design Efficiency



Automate the design of efficient neural networks

IBM Research AI / © 2021 IBM Corporation

Automation for Model Synthesis for Emerging and Existing AI Hardware



Hardware-aware Neural Architecture Search is a key step towards democratizing AI

#### AutoML is catching up with human experts

IBM Research AI / © 2021 IBM Corporation

Today's neural networks are extremely complex and keep growing in size.

Inception-v4

65M 95M 125M 155M

Operations (G.Opri

GoogLeNe ENet fd-MobileNet BN-NIN ShuffleNet ResNet-152 VGG-16

VGG-19

Hand-crafting neural networks is an expensive, timeconsuming and and ad-hoc process



Edge Intelligence is inevitable and requires hardwareefficiency built into the design of neural networks. A hard task for nonhardware experts.



https://culurciello.medium.com/analys is-of-deep-neural-networksdcf398e71aae

20

25

## Key Questions

- What the key algorithmic components of HW-NAS?
- What search spaces are more friendly to hardware-aware NAS?
- What are the existing challenges and future directions?

| Cornell University                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | We<br>the Simons            |  |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------|--|
| rXiv.org > cs > arXiv:2101.09336                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | Search<br>Help   Advanced S |  |
| Computer Science > Machine Learning                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | neip   Advanced S           |  |
| [Submitted on 22 Jan 2021]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |                             |  |
| A Comprehensive Survey on Hardware-Aware Neural Architecture Search                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                             |  |
| Hadjer Benmeziane, Kaoutar El Maghraoui, Hamza Ouarnoughi, Smail Niar, Martin Wistuba, Naigang Wang                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                             |  |
| Neural Architecture Search (NAS) methods have been growing in popularity. These techniques have been fundamental to automate and speed up the time consuming and error-prone process of synthesizing novel<br>Deep Learning (DL) architectures. NAS has been extensively studied in the past few years. Arguably their most significant impact has been in image classification and object detection tasks where the state of the art<br>results have been obtained. Despite the significant success achieved to date, applying NAS to real-world problems still poses significant challenges and is not widely practical. In general, the synthesized<br>Convolution Neural Network (CNN) architectures are too complex to be deployed in resource-limited platforms, such as loT, mobile, and embedded systems. One solution growing in popularity is to use multi-<br>objective optimization algorithms in the NAS search strategy by taking into account execution latency, energy consumption, memory footprint, etc. This kind of NAS, called hardware-aware NAS (HW-NAS), makes<br>searching the most efficient architecture more complicated and opens several questions.<br>In this survey, we provide a detailed review of existing HW-NAS research and categorize them according to four key dimensions: the search space, the search strategy, the acceleration technique, and the hardware<br>cost estimation strategies. We further discuss the challenges and limitations of existing approaches and potential future directions. This is the first survey paper focusing on hardware-aware NAS. We hope it serves<br>as a valuable reference for the various techniques and algorithms discussed and paves the road for future research towards hardware-aware NAS. |                             |  |
| Comments: Submitted to Proceedings of IEEE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |                             |  |

Collaboration with University Polytechnique Hauts-de-France (UPHF)

## Hardware-aware NAS (HW-NAS) vs. NAS



## NAS ... a Trending Topic



Hardware-Aware NAS

**All NAS papers** 

Main targeted journals and conferences: NeurIPS, CVPR, ECCV, ACM GEECO, IEEE Access (data collected as of December 2020)

## HW-NAS Trends



Type of Networks considered in HW-NAS



## **General NAS Components**



## General structure of HW-NAS







#### Architecture Search Space

□ Layer-wise Search Space: Select for each layer an operation among a set.

- **Cell-based Search Space:** 
  - Select the operations in a block and repeat the block.
- □ Hierarchical Search Space:
  - Select the operations in a block and then the order of the blocks.
- □ Hyperparameter Search Space:
  - Fix a macro-Architecture and select its architecture hyperparameters

#### Hardware Search Space

- Template-based: Select from a set of pre-defined templates (e.g., NASAIC selects from a set of ASICs templates)
- Parameters-based: Select from a set of different parameter configurations (e.g., FNAS selects the tiling parameters that suit the FPGA used, DANCE selects the number of PE and register file size in an FPGA)

## **HW-NAS Search Formulation**

- NAS Formulations: (2) max  $f(\alpha, \delta) \alpha \in A$  where,
  - A is the space of all feasible architectures (search space).
  - The optimization method is looking for the architecture  $\alpha$  that maximizes the performance metric denoted by *f* for a given dataset  $\delta$  (f could simply be the accuracy of the model)
- Constrained optimization

$$\max_{\alpha \in A} f((\alpha) \cdot [LAT(\alpha)/T]^w$$

 LAT is the latency of the model and T is the threshold. w is a learnable parameter to control the effect of the hardware constraints on the global objective function.

## Single vs. Multi-Objective Optimization



## ProxylessNAS Example

• ProxylessNAS uses a loss function that comprises of the cross-entropy (CE) loss and hardwareaware constraints.

$$\mathcal{L} = \mathcal{L}_{CE} + \lambda_1 ||w||^2 + \lambda_2 E[\text{latency}]$$

- Equation above illustrates the loss calculated by the reinforcement learning agent used by ProxylessNAS.
- $\lambda 1$  and  $\lambda 2$  are learnable parameters that adjust the effect of the efficiency of the overall loss.
- A policy is learned that decides whether to add, remove or keep a layer as well as whether to alter its number of filters.



Making latency differentiable by introducing latency regularization loss

## DNAS: Differentiable Neural Architecture Search



Figure 1: Animation of how DARTS and other weight-sharing methods replace the discrete assignment of one of four operations  $o \in O$  to an edge e with a  $\theta$ -weighted combination of their outputs. At each edge e in the network, the value at input node  $e_{in}$  is passed to each operation in  $O = \{1: \text{Conv } 3 \times 3, 2: \text{Conv } 5 \times 5, 3: \text{Pool } 3 \times 3, 4: \text{Skip Connect}\}$ ; the value at output node  $e_{out}$  will then be the sum of the operation outputs weighted by parameters  $\theta_{e,o} \in [0, 1]$  that satisfy  $\sum_{o \in O} \theta_{e,o} = 1$ .

Image courtesy of DeterminedAI and CMU CSD IBM Research AI / © 2021 IBM Corporation Most famous speed up technique used in NAS which relies on weight sharing and one-shot models

Earlier methods used reinforcement learning and required many computational resources.

- 2000 GPU days of reinforcement learning or 3150 GPU days of evolution.
- Not at all feasible

DNAS reduced the search time to **few GPU days by** relaxing the fixed set of operations into continuous parameters.

## Challenges of Targeting Multiple Hardware Devices

- Different devices have different design choices
- different hardware devices can favor very different network structures under the same hardware-cost metric, and



Per layer profiling of MobileNetV3 minimalistic on Pixel4 CPU uint8 and QUALCOMM hexagon. The leftmost is the input layer while the output layer is on the right Source: G. Chu et al, "Discovering multi-hardware mobile models via architecture search," 2020.



Kendall rank correlation between real/estimated hw cost between different devices considering the FBNet search space

Source: C. Li et al, «HW-NAS Bench», ICLR 2021

## HW-NAS: Speedup Techniques

- Early Stopping
- Proxy Dataset
- Weight Sharing/ Super Network
- Accuracy Predictor
  - Peephole
  - PNAS
  - TAPAS

#### train a **once-for-all** network



Source: H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han, "Once-for-all: Train one network and specialize it for efficient deployment," 2020.

# HW Cost Evaluation Techniques

| Method            | How the method is achieved ?      | Hardware Cost                           | References       |  |
|-------------------|-----------------------------------|-----------------------------------------|------------------|--|
|                   |                                   | Metric                                  |                  |  |
| Real-time         | The sampled model is executed     | Latency                                 | MNASNet[21]      |  |
| measurements      | on the hardware target while      |                                         | NetAdapt[58]     |  |
|                   | searching.                        |                                         | [59]             |  |
|                   |                                   |                                         | MCUNet[45]       |  |
|                   |                                   | Energy                                  | NetAdapt[58]     |  |
|                   |                                   |                                         | MONAS[36]        |  |
|                   |                                   |                                         | [60]             |  |
| Lookup Table Mod- | A lookup table is created before- | Latency                                 | FBNet[16]        |  |
| els               | hand and filled with each opera-  |                                         | HotNAS[32]       |  |
|                   | tor latency on the targeted hard- |                                         |                  |  |
|                   | ware. Once the search starts, the |                                         |                  |  |
|                   | system will calculate the overall |                                         |                  |  |
|                   | cost from the lookup table.       |                                         |                  |  |
|                   |                                   | Latency                                 | FNAS[19]         |  |
| Low Fidelity      | Compute a rough estimate using    | , i i i i i i i i i i i i i i i i i i i | NASCaps[29]      |  |
| Estimation        | the processing time, the stall    |                                         | [61]             |  |
|                   | time, and the starting time.      |                                         | [42]             |  |
|                   |                                   | Energy                                  | NASCaps[29]      |  |
|                   |                                   | Memory footprint                        | NASCaps[29]      |  |
|                   |                                   | Area                                    | NASAIC[23]       |  |
|                   | Build a ML model to predict       | Latency                                 | proxylessNAS[20] |  |
| Prediction Model  | the cost using architecture and   |                                         | NASAIC[23]       |  |
|                   | dataset features.                 |                                         | NeuNets[62]      |  |
|                   |                                   |                                         | LEMONADE[36]     |  |
|                   |                                   |                                         |                  |  |

## Key Challenges

- Combinatorial explosion of the search space if considering hw-metrics or efficient strategies like quantization and pruning.
- High computational cost of the search strategy
- Adding hardware-cost increases further the search complexity
- Benchmarking and Reproducibility
  - Lack of HW-NAS benchmarks. HW-NAS-Bench (first hardware-aware Benchmark, published at ICLR 2021)
- Transferability of the AI Models
  - Cell-based vs. layer-wise search spaces
- Transferability of the hardware-aware NAS across multiple platforms
  - Transfer the entire NAS process
  - Specialize the final model
- Most NAS and HW-NAS approaches are limited to computer vision applications.

## HW-NAS Key Takeaways

- HW-NAS is a hard problem and still not a fully solved still no principled approaches exist
- Largely unexplored in non-vision application areas.
  - Success of NAS to discover Efficient networks for Vision still not as good for NLP.
- Key techniques that stand out
  - Layer-wise search spaces are more hardware friendly than cell-based search search (e.g., FBNet)
  - Using differentiable techniques and weight sharing
  - Using both gradient based methods and RL/Evolutionary methods (First is used to train the architecture weights and second is used to incorporate the hardware cost)
  - Reducing the search space
  - Using predictors for training and predictors for hardware cost measurements
  - Quantization-aware and pruning-aware predictors

# Efficient Deep Learning Methods are Inevitable and still an Evolving Landscape



# Thank you!

Dr. Kaoutar El Maghraoui Principal Research Staff Member IBM Research AI

@kaoutarTech

IBM Research

