

# Aurora Overview

NATIONAL L/ IORATORY





2022 ALCF SDL Workshop October 6, 2022 Colleen Bertoni and Scott Parker



#### Aurora

Leadership Computing Facility Exascale Supercomputer

#### Peak Performance **≧ 2 Exaflops DP**

#### Intel GPU Ponte Vecchio (PVC)

Intel Xeon Processor Sapphire Rapids with High Bandwidth Memory

Platform HPE Cray-Ex **Compute Node** 2 Xeon SPR+HBM processors 6 Ponte Vecchio GPUs Node Unified Memory Architecture 8 fabric endpoints

#### **GPU Architecture**

Intel XeHPC architecture High Bandwidth Memory Stacks

#### **Node Performance** >130 TF

System Size >9,000 nodes

#### **Aggregate System Memory**

>10 PB aggregate System Memory

#### System Interconnect

HPE Slingshot 11 Dragonfly topology with adaptive routing

#### **Network Switch**

25.6 Tb/s per switch (64 200 Gb/s ports) Links with 25 GB/s per direction

#### **High-Performance Storage** 220 PB ≥25 TB/s DAOS bandwidth

#### Software Environment

- C/C++
- Fortran
- SYCL/DPC++
- OpenMP offload
- Kokkos
- RAJA
- Intel Performance Tools

### **Aurora Cabinets Installed at Argonne**











# **JLSE Testbeds to Aurora Node**





## **Aurora Compute Node**



- 6 X<sup>e</sup> Architecture based GPUs (Ponte Vecchio)
  - All to all connection
- 2 Intel Xeon (Sapphire Rapids) processors
- Unified Memory Architecture across CPUs and GPUs
- 8 Slingshot Fabric endpoints



### **The Evolution of Intel GPUs**





6

### **The Evolution of Intel GPUs**





### X<sup>e</sup> Vector Engine (Execution Unit)

- □ The vector engine executes instructions
  - Register file
  - Multiple issue ports
  - Vector pipelines
    - Float Point
    - Integer
    - Extended Math
    - □ FP 64 (optional)
  - Matrix Extension (XMX)
  - Thread control
  - Branch
  - Send (memory)

| Vector<br>Engine | ХМХ |
|------------------|-----|
|                  |     |

Image: https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top/xe-arch.html



### X<sup>e</sup>-Core (Subslice)

#### A X<sup>e</sup> Core contains:

- □ Vector engines (execution units)
- Matrix engines (XMX)
- Thread dispatch
- Instruction cache
- L1 and shared local memory
- Load/Store



Image: https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top/xe-arch.html



### X<sup>e</sup>-Slice

A X<sup>e</sup>-Slice contains
 Composed of X<sup>e</sup>-cores



Image: https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top/xe-arch.html



# X<sup>e</sup>-Stack (Tile)

- A X<sup>e</sup>-Stack contains
  - □ Variable number of X<sup>e</sup>-Slices
  - Shared L2 cache
  - Memory controllers
  - Media engine
  - X<sup>e</sup>-links high-speed coherent fabric (GPU to GPU)



Image: https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top/xe-arch.html



# **High Level X<sup>e</sup> Architecture**

□ X<sup>e</sup> GPU is composed of

□ X<sup>e</sup> Stacks

Memory Fabric



Image: https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top/xe-arch.html



# Intel Ponte Vecchio (XeHPC) GPU

Intel provided an introduction to the Ponte Vecchio GPU at their 2021 Intel Architecture Day event
 <u>https://www.intel.com/content/www/us/en/newsroom/resources/press-kit-architecture-day-2021.html</u>





### **Intel Ponte Vecchio Architectural Components**





### **Distributed Asynchronous Object Store (DAOS)**

- Primary storage system for Aurora
- Offers high performance in bandwidth and IO operations
  - 230 PB capacity
  - □ ≥ 25 TB/s
- Provides a flexible storage API that enables new I/O paradigms
- Provides compatibility with existing I/O models such as POSIX, MPI-IO and HDF5
- Open source storage solution





### **Pre-exascale and Exascale US Landscape**

| System     | Delivery | CPU + Accelerator<br>Vendor |
|------------|----------|-----------------------------|
| Summit     | 2018     | IBM + NVIDIA                |
| Sierra     | 2018     | IBM + NVIDIA                |
| Perlmutter | 2021     | AMD + NVIDIA                |
| Frontier   | 2021     | AMD + AMD                   |
| Polaris    | 2021     | AMD + NVIDIA                |
| Aurora     | 2022     | Intel + Intel               |
| El Capitan | 2023     | AMD + AMD                   |

- Heterogenous Computing (CPU + Accelerator)
- Varying vendors



### oneAPI

- Industry specification from Intel (<u>https://www.oneapi.com/spec/</u>)
  - Language and libraries to target programming across diverse architectures (DPC++, APIs, low level interface)
- Intel oneAPI products and toolkits (<u>https://software.intel.com/ONEAPI</u>)
  - Languages
    - Fortran (w/ OpenMP 5+)
    - C/C++ (w/ OpenMP 5+)
    - DPC++
    - Python
  - Libraries
    - oneAPI MKL (oneMKL)
    - oneAPI Deep Neural Network Library (oneDNN)
    - oneAPI Data Analytics Library (oneDAL)
    - MPI
  - Tools
    - Intel Advisor
    - Intel VTune
    - Intel Inspector
  - 17 Argonne Leadership Computing Facility



https://software.intel.com/oneapi



# **Available Aurora Programming Models**

Aurora applications may use:

- DPC++/SYCL
- OpenMP

Kokkos

**R**aja

OpenCL

Experimental 

#### □ Not available on Aurora:

OpenACC

















### DPC++ (Data Parallel C++) and SYCL

#### SYCL

- Standard developed by Khronos and announced in 2014
- □ The latest SYCL specification (SYCL 2020) was release in 2021
- □ SYCL is a C++ based abstraction layer (standard C++17)
- Builds on OpenCL **concepts** (but single-source)
- □ SYCL is designed to be as close to standard C++ as possible





### DPC++ (Data Parallel C++) and SYCL

#### SYCL

- Standard developed by Khronos and announced in 2014
- □ The latest SYCL specification (SYCL 2020) was release in 2021
- □ SYCL is a C++ based abstraction layer (standard C++17)
- Builds on OpenCL **concepts** (but single-source)
- □ SYCL is designed to be as close to standard C++ as possible



#### DPC++

20

- Part of Intel oneAPI specification and Intel's implementation of SYCL
- □ Intel extension of SYCL to support new innovative features
- Open source and available on github
- □ Contains a Plugin Interface (PI) to allow DPC++ to run on multiple devices



### OpenMP

- OpenMP is a widely supported and utilized programming model
- OpenMP 5 constructs will provide directives based programming model for Intel GPUs
- Available for C, C++, and Fortran and optimized for Aurora
- Current OpenMP 5.1 spec supports offloading to an accelerator/GPU
  - Support started with OpenMP 4
- OpenMP with offload support offers a potential path to developing performance portable applications
- Multiple compilers and vendors providing OpenMP implementations
- Community has a consensus what is the "most common" subset of OpenMP features to be supported on devices.
  - OpenMP features inappropriate to GPUs are often not implemented





### **Intel Fortran for Aurora**

Generation Fortran 2008

OpenMP 5

New compiler—LLVM backend
 Strong Intel history of optimizing Fortran compilers

Beta available today in OneAPI toolkits





ttps://software.intel.com/content/www/us/en/develop/tools/oneapi/components/fortran-compiler.html 22 Argonne Leadership Computing Facility



# **Intel VTune and Advisor**

#### □ Vtune Profiler

Widely used performance analysis tool
 Supports analysis on Intel GPUs

#### Advisor

- Provides roofline analysis
- Offload analysis will identify components for profitable offload
  - □ Measure performance and behavior of original code
  - Model specific accelerator performance to determine offload opportunities
  - Considers overhead from data transfer and kernel launch



# Intel MKL – Math Kernel Library

- □ Highly tuned algorithms
  - FFT
  - Linear algebra (BLAS, LAPACK)
  - □ Sparse linear algebra
  - Statistical functions
  - Vector math
  - Random number generators

Optimized for every Intel platform

- oneAPI MKL (oneMKL)
  - https://software.intel.com/en-us/oneapi/mkl

Latest oneAPI toolkits include DPC++ support and C/Fortran OpenMP offload



# **AI and Analytics**

#### Libraries to support AI and Analytics

- OneAPI Deep Neural Network Library (oneDNN)
  - □ High Performance Primitives to accelerate deep learning frameworks
  - Devers Tensorflow, PyTorch, MXNet, Intel Caffe, and more
- oneAPI Data Analytics Library (oneDAL)
  - Classical Machine Learning Algorithms
  - □ Easy to use one-line daal4py Python interfaces
  - Powers Scikit-Learn

Apache Spark MLlib



### **Aurora Applications Overview**

- ALCF and Intel are working with over 40 projects to ready codes for Aurora:
  - Argonne Early Science Program (ESP) projects contains a mix of simulations, learning and data projects
    DOE Exascale Computing Project (ECP) contains applications (AD) and software (ST) projects
- Over 50 applications and software packages are being prepared for Aurora:
- Involves effort from over 60 Argonne and Intel people and numerous outside teams
- Significant progress on readying applications for Aurora has occurred
  - -ECP and ESP teams have been actively porting and testing code and reporting issues
  - -Argonne and Intel have held quarterly application status reviews to identify top issues
  - -Monthly priority bug meeting between ANL and Intel to follow-up and track issue resolution
  - -Receiving regular SDK updates from Intel
  - Test framework on JLSE allows issue reproducers and applications tests to be run before software updates and nightly to identify changes



### Showcase

#### ExaSMR: NekRS Performance on Ponte Vecchio

#### Ponte Vecchio with Intel OneAPI DPC++ implementation

#### 1.5x performance lead

**ExaSMR:** Small modular reactors (SMRs) and advanced reactor concepts (ARCs) will deliver clean, flexible, reliable, and affordable electricity while avoiding the traditional limitations of large nuclear reactor designs,

https://www.exascaleproject.org/research-project/exasmr/



Figure 10: NekRS: potential temperature distributions in [K] at time 6h and z=100m on different resolutions of  $\Delta x=3.12m$  (left), 1.56m (center), and 0.78m (right) corresponding to the number of grid points,  $n=128^3$ ,  $256^3$ , and  $512^3$ , respectively.  $\Delta x$  represents the average grid-spacing for the spectral elements,  $E = 16^3$ ,  $32^3$  and  $64^3$  and the polynomial order N = 8 on the domain  $400m \times 400m$ .

https://ceed.exascaleproject.org/docs/ceed-ms38-report.pdf

Relative Performance of NekRS Benchmarks w/ problem size of 8196 (Averaged throughput, higher is better)



#### Application Summary:

**NekRS** is an open-source Navier Stokes solver based on the spectral element method targeting classical processors and accelerators like GPUs. The code started as a fork of libParanumal in 2019. For API portable programming OCCA is used. https://github.com/argonne-lcf/nekRS/

OCCA is an open-source library which aims to make it easy to program different types of devices (e.g. CPU, GPU, FPGA). It provides a unified API for interacting with backend device APIs (e.g. OpenMP, CUDA, OpenCL), uses just-in-time compilation to build backend kernel, and provide a kernel language, a minor extension to C, to abstract programming for each backend. https://libocca.org

See backup for workloads and configurations. Results may vary.

Intel does not an or a state of the state of

intel. 22

H. Jiang, "Intel's Ponte Vecchio GPU : Architecture, Systems & Software," 2022 IEEE Hot Chips 34 Symposium (HCS), 2022, pp. 1-29, doi: 10.1109/HCS55958.2022.9895631.



### Showcase

#### ExaSMR: OpenMC Performance on Ponte Vecchio

Monte Carlo particle transport code for exascale computations

#### Ponte Vecchio with OpenMP Target offload 2x performance lead



Exascale Compute Project Annual Meeting 2022 presentation: https://www.alcf.anl.gov/events/2022-ecp-annual-meeting International Conference on Physics of Reactors 2022 presentation: https://www.ans.org/meetings/physor2022/session/view-976/

#### https://docs.openmc.org

OpenMC Depleted Fuel Inactive Batch Performance on HM-Large Reactor with 40M particles (particles/second, Higher is better)



Application Summary: OpenMC is a Monte Carlo particle transport application that has recently been ported to the OpenMP target offloading programming model for use on GPU-based systems. The Monte Carlo method employed by OpenMC is considered the "gold standard" for high-fidelity simulation while also having the advantage of being a general-purpose method able to simulate nearly any geometry or material without the need for domain-specific assumptions. However, despite the extreme advantages in ease of use and accuracy. Monte Carlo methods like those in OpenMC often suffer from a very high computational cost. The extreme performance gains OpenMC has achieved on GPUs, as compared to traditional CPU architectures, is finally bringing within reach a much larger class of problems that historically were deemed to expensive to simulate using Monte Carlo methods. The leap in performance that GPUs are now offering carries with it the potential to disrupt a number of engineering technology stacks that have traditionally been dominated by non-general deterministic methods. For instance, faster MC applications may greatly expand the design space and simplify the regulation process for new nuclear reactor designs – potentially improving the economics of nuclear energy and therefore helping to solve the world's climate crisis.



See backup for workloads and configurations. Results may vary.
 Intel does no Agapartzedgicausial this dimarationation of the water and the advantation of the water and the advantation of the water and the advantation of the water advantation of

intel. 23

H. Jiang, "Intel's Ponte Vecchio GPU : Architecture, Systems & Software," 2022 IEEE Hot Chips 34 Symposium (HCS), 2022, pp. 1-29, doi: 10.1109/HCS55958.2022.9895631.



# Thank You

