

# Intel<sup>®</sup> Xeon Phi<sup>™</sup> basics and architecture

September 22<sup>nd</sup>-23<sup>rd</sup> 2015 University of Copenhagen, Denmark

# Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright ©, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

# Intel Technologies for HPC

Processors Intel<sup>®</sup> Xeon<sup>®</sup> Processor



Coprocessor Intel<sup>®</sup> Many Integrated Core

inte

Network & Fabric



I/O & Storage

Software & Services



# Transforming the economics of HPC



Executing to Moore's Law

Predictable Silicon Track Record – well and alive at Intel.

Enabling new devices with higher performance and functionality while controlling power, cost, and size



### Driving innovation and integration

### Enabled by Leading Edge Process Technologies



#### Integrated Today

SOFTWARE AND SERVICES



#### Coming in the near Future

### Intel® Xeon Phi<sup>™</sup> Coprocessor Product Family Based on Intel® Many Integrated Core (MIC) Architecture





- Coprocessor
- Over 1 TF DP Peak
- Up to 61 Cores
- Up to 16GB GDDR5

#### SOFTWARE AND SERVICES

2016 Knights Landing

- The processor version of the next generation Intel Xeon Phi product
- 14 nm process
- Processor & Coprocessor
- Over 3 TF DP Peak
- Up to 72 Cores
- On Package High-Bandwidth
  - Memory
- 3x single-thread performance
- Out-of-order core
- Integrated Intel<sup>®</sup> Omni-Path



#### FUTURE

Knights Hill Next generation of Intel<sup>®</sup> MIC Architecture Product Line

- 10 nm process
- 2nd Generation Integrated Intel<sup>®</sup> Omni-Path
- In planning –

(intel)

#### XEON PHI<sup>®</sup>

Per Intel's announced products or planning process for future products



# Hardware architecture

Intel<sup>®</sup> Xeon Phi<sup>™</sup>, Knights Corner (KNC)

### Architectural overview



- Up to 61 cores
- 8-16 GB GDDR5 memory (ECC)
- PCIe Gen2 (client) x16 per dir.

- Hardware cache coherency
- 8 memory controllers
  - 16 GDDR5 channels
  - Up to 5.5GT/s



- Pentium (P54C) scalar instruction set (X87)
  - In order-operation
- 64bit addressing
- 512bit vector unit
- 4 HW threads/core
- Two pipelines:
  - Scalar
  - Vector/Scalar



- 2 issue (1 scalar/1 vector)
- 2 cycle decoder: noback to back cycle issue from the same context (thread)
  - At least two HW contexts (thread/proc) to fully utilize the core
- Most vector instructions have 4 clock latency



- L1 caches
  - 32K I-cache per core
  - 32K D-cache per core
  - 8 way associative
- 3 cycle access latency
- Up to 8 outstanding requests
- 64byte cache line
- Fully coherent (MESI)



- L2 cache
  - 512K Unified per core
- 8 way associative
- 11 cycle raw access latency
- Up to 32 outstanding requests
- Streaming HW prefetcher
- Fully coherent (MESI)

### Cache coherency





- Vector unit width 512 bits
- 32 512-bit vector registers per context
  - 16 floats or 8 doubles
  - 8 vector mask registers for per lane conditional operations
- ALU support for
  - int32/float32 operations, float64 arithmetic, int64 logic ops
  - Ternary ops including Fused-Multiply-Add
  - Broadcast/swizzle support, float16 upconvert
- Most ops: 4-cycle latency 1-cycle throughput
  - Matches 4-cycle round robin of integer unit
- Mostly IEEE 754 2008 compliant

# Architectural comparison

### Intel<sup>®</sup> Xeon®

- General instruction streams
  - High single-thread perf.
  - High memory capacity
- Core/memory aggr. via sockets and nodes
- Instruction set extensions
  - SIMD e.g., Intel<sup>®</sup> AVX/AVX2
  - Virtualization, AES, etc.

### Intel<sup>®</sup> Xeon Phi<sup>™</sup>

- General instruction streams
  - Highly parallel workloads
  - High memory bandwidth
- Up to 61 cores/die, aggr. via PCIe and nodes
- SIMD (512-bit registers)
  - Gather/scatter, FMA, masked instructions

SOFTWAF

Intel Xeon Phi is a coprocessor for highly parallel workloads.

# Architectural comparison (in numbers)

|                          | Intel <sup>®</sup> Xeon <sup>®</sup> E5-2670 v3 | Intel® Xeon Phi™ 7120 |
|--------------------------|-------------------------------------------------|-----------------------|
| Cores                    | 12                                              | 61                    |
| Clock rate (Ghz)         | 2.3 (3.1 with turbo)                            | 1.24 (1.3 with turbo) |
| Memory (GB)              | 32 (typical), 768 (maximum)                     | 16                    |
| Cache (L1,L2,L3)         | 32kB, 256kB, 30Mb (shared)                      | 32kB, 512kB, -        |
| Peak perf (DP<br>Gflops) | 441.6                                           | 1210.24               |
| Memory BW<br>(GB/sec)    | 68                                              | 352                   |

# **Highly Parallel Applications**



Theoretical acceleration of a highly parallel processor over a Intel® Xeon® parallel processor (<1: Intel® Xeon® faster) – For illustration only

- Efficient...
  - Vectorization
  - Threading
  - Parallel execution
- ...drives higher performance for *suitable* scalable applications

### Coprocessor system topology today





# Outlook on future hardware architecture

Intel<sup>®</sup> Xeon Phi<sup>™</sup>, Knights Landing (KNL)

### Future Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor: Knights Landing



 Unconstrained by PCIe\*offload bottlenecks

- Up to **72 cores** (Silvermont based)
  - 3x single thread performance over KNC
- Excellent compute density and power efficiency
  - >3 Tflops peak DP performance
- Integrated high bandwidth memory and fabric

#### SOFTWARE AND SERVICES

All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subjected change wither



- Silvermont based core
  - Out-of-order architecture
  - Binary compatible with Intel® Xeon® (AVX-512)
- 4 HW threads/core
- AVX-512 vector instructions
  - Prefetch instructions
  - Conflict Detection instructions
  - Exponential and Reciprocal instructions
- Advanced branch prediction in hardware
- 2D mesh architecture

#### SOFTWARE AND SERVICES

All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subjection change with

# KNL integrated on-package memory



- Up to 5x STREAM bandwidth over DDR4 (>400GB/sec)
- Cache model
  - Hardware automatically manages the integrated onpackage memory as cache
- Flat model
  - Programmer manages the integrated on-package memory and external DDR for peak performance
- Hybrid model

#### SOFTWARE AND SERVICES

All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subjected during wither Diagram is for conceptual purposes only and only illustrates a GPU and memory – it is not to scale, and is not representative of a subscontege



# Intel<sup>®</sup> IMCI (Intel<sup>®</sup> Initial Many Core Instructions)

Intel<sup>®</sup> Xeon Phi<sup>™</sup>, Knights Corner (KNC)

### **Vector Instruction Format**

- 3 operand form with explicit destination register instruction destination, source1, source2
  - $\rightarrow$  Source registers are not destroyed
  - → Very compact code
- (Most) MIC instructions can be masked instruction destination {mask}, source1, source2
  - → Result of masking is non-destructive, i.e. destination is preserved

# Examples of Intel<sup>®</sup> IMCI

#### Ternary Operands

- vop zmm1, zmm2, zmm3 , zmm1 = zmm2:::vop:::zmm3
- **vop zmm1**, **zmm2**, **[ptr]** , zmm1 = zmm2::: *vop*:::MEM[ptr]
- Fused operation Multiply-Add, Multiply-subtract
  - vfmadd132ps zmm1, zmm2, zmm3 , zmm1=zmm1\*zmm3+zmm2
  - vfmadd213ps zmm1, zmm2, zmm3 , zmm1=zmm2\*zmm1+zmm3
  - vfmadd231ps zmm1, zmm2, zmm3 , zmm1=zmm2\*zmm3+zmm1
  - Standard IEEE 754-2008R 0.5 ulps not 1 upls as two operations
- Prefetching
  - Memory Prefetching minimize the likelihood of L1, L2 cache misses
  - Intel® Xeon Phi Coprocessor has a hardware prefetcher
  - L1 prefetch: vprefetch1 ptr, hint
  - L2 prefetch: vprefetch2 ptr, hint

# Extended Math Unit (EMU)

- Single precision transcendental functions via minmax quadratic polynomial approximation
- Elementary functions
  - Reciprocal: 1/x
  - Reciprocal square root: 1/sqrt(x)
  - Logarithm: log2(x)
  - Exponential: exp2 (x)
- Derived functions
  - Power: x^y=exp2(y\*log2(x))
  - Square root: sqrt(x) = x\*1/sqrt(x)
  - Division div (x/y) = x\*1/y
  - Natural logarithm ln(x) = log2(x) \*1/log2(e)

| Function | Latency | Throughput |
|----------|---------|------------|
| exp2()   | 8       | 2          |
| log2()   | 4       | 1          |
| rcp()    | 4       | 1          |
| rsqrt()  | 4       | 1          |
| sqrt()   | 8       | 2          |
| pow()    | 16      | 4          |
| div()    | 8       | 2          |
| ln()     | 8       | 2          |



# System architecture

#### Intel<sup>®</sup> Xeon Phi<sup>™</sup>, Knights Corner (KNC)

# Enabling and advancing parallelism

Intel tools, libraries and parallel models extend to multicore, many-core and heterogeneous computing



### System architecture overview



### Detailed system architecture overview



11111

### Intel<sup>®</sup> Xeon Phi<sup>™</sup> infrastructure

- Operating System (OS)
  - Embedded Linux\* based on Conjure/Yocto (very few customizations)
  - You can assume at least a BusyBox environment
- Other infrastructure
  - Intel<sup>®</sup> Manycore Platform Software Stack (Intel<sup>®</sup> MPSS)
  - Intel<sup>®</sup> Coprocessor Offload Infrastructure (Intel<sup>®</sup> COI)
  - Intel<sup>®</sup> Symmetric Communications Infrastructure (Intel<sup>®</sup> SCI)



# Programming environment

Intel<sup>®</sup> Xeon Phi<sup>™</sup>, Knights Corner (KNC)

### Development environment

- Standard Intel<sup>®</sup> development environment is available:
  - Intel<sup>®</sup> Composer: C, C++ and Fortran Compilers
  - Standard runtime libraries, including pthreads\*
  - OpenMP\*
  - Intel<sup>®</sup> **MPI** Library support for the Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor
  - Parallel Programming Models
    - Intel<sup>®</sup> Threading Building Blocks (Intel<sup>®</sup> TBB)
    - Intel<sup>®</sup> Cilk<sup>™</sup> Plus
  - Tools
    - Intel support for **gdb**, Intel<sup>®</sup> VTune<sup>™</sup> Amplifier XE
  - Intel Performance Libraries (e.g. Intel Math Kernel Library)
    - Three versions: host-only, coprocessor-only, heterogeneous

# Programming models

#### Native



- Target Code: Highly parallel (threaded and vectorized) throughout
- Potential Bottleneck: Serial/scalar code

#### SOFTWARE AND SERVICES

Offload



- Target Code: Mostly serial, but with expensive parallel regions
- Potential Bottleneck: PCIe data transfers

#### Symmetric



- Target Code: Highly parallel and performs well on both platforms
- Potential Bottleneck: Load imbalance

# Programming models

#### • MPI

- Used for "native" and "symmetric" execution
- Can launch ranks across processors and coprocessors
- OpenMP
  - Used for "native", "offload" and "symmetric" execution
  - OpenMP 4.0 standard supports device constructs for offloading
- Many real-life HPC codes use a native MPI/OpenMP hybrid
  - Balance task granularity by tuning combination of ranks/threads (e.g.16 MPI ranks x 15 OpenMP threads)

# Standards and existing code

### Existing source code

- In most cases, code can be simply recompiled
- All IA/x86 assumptions hold incl. legacy instructions
- Cross-compiled code can be used in offload section e.g., Intel<sup>®</sup> Threading Building Blocks
- Targeting Intel<sup>®</sup> Xeon Phi<sup>™</sup> does not waste effort
  - Tuning takes effort, but leverages existing standards
  - Optimizations usually lead to improved performance on Intel® Xeon<sup>®</sup>

# Intel<sup>®</sup> Parallel Studio XE 2016



| Composer Edition                           | Professional Edition                                                   | Cluster Edition                                                                                                                     |  |
|--------------------------------------------|------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|--|
| Intel® C++ Compiler                        | Intel® C++ Compiler                                                    | Intel® C++ Compiler                                                                                                                 |  |
| Intel® Fortran Compiler                    | Intel® Fortran Compiler                                                | Intel® Fortran Compiler                                                                                                             |  |
| Intel® Data Analytics Acceleration Library | Intel® Data Analytics Acceleration Library                             | Intel® Data Analytics Acceleration Library                                                                                          |  |
| Intel® Threading Building Blocks           | Intel® Threading Building Blocks                                       | Intel® Threading Building Blocks                                                                                                    |  |
| Intel® Integrated Performance Primitives   | Intel® Integrated Performance Primitives                               | Intel® Integrated Performance Primitives                                                                                            |  |
| Intel® Math Kernel Library                 | Intel® Math Kernel Library                                             | Intel® Math Kernel Library                                                                                                          |  |
| Intel® Cilk™ Plus & Intel® OpenMP*         | Intel® Cilk™ Plus & Intel® OpenMP*                                     | Intel® Cilk™ Plus & Intel® OpenMP*                                                                                                  |  |
|                                            | Intel® Advisor XE<br>Intel® Inspector XE<br>Intel® VTune™ Amplifier XE | Intel® Advisor XE<br>Intel® Inspector XE<br>Intel® VTune™ Amplifier XE<br>Intel® MPI Library<br>Intel® Trace Analyzer and Collector |  |
| Bundle or Add-on:                          | Add-on:                                                                | Add-on:                                                                                                                             |  |
| Rogue Wave IMSL* Library                   | Rogue Wave IMSL* Library                                               | Rogue Wave IMSL* Library                                                                                                            |  |

Additional configurations including, floating and academic, are available at: http://intel.ly/perf-tools

### Software architecture



## **Execution models**

- Intel MKL Automatic Offload (AO)
  - Transparent data transfer and execution management
  - Limited to key functions (sufficient FLOP/Byte ratio)
  - Automatically uses host and (multiple) targets
  - No code changes required
- Compiler Assisted Offload (CAO)
  - Explicit control of data transfer / persistence
  - Intel Compiler offload pragmas/directives
    - Language Extension for Offload (LEO)
    - OpenMP\* 4.0 device constructs
  - Can be used together with Automatic Offload
- Native Execution
  - Coprocessors used as independent nodes

### Intel<sup>®</sup> Math Kernel Library (Intel<sup>®</sup> MKL)

- Single -and multi-threaded libraries
- Cluster support for important domains
- Support for large problem sizes (ILP)
- Conditional Numerical Reproducibility (CNR)
- Support for Intel<sup>®</sup> Xeon Phi<sup>™</sup> coprocessors
  - Automatic offload, and compiler-assisted offload
  - Manycore-hosted execution, cluster support, etc.
- Enabled early for future hardware
  - KLN support: AVX-512 instruction set

### Intel<sup>®</sup> Math Kernel Library (Intel<sup>®</sup> MKL)

| Linear Algebra                                                                                                                                                  | Fast Fourier<br>Transforms                                                     | Vector Math                                                                                                          | Vector RNGs                                                                                                                                                | Summary Statistics                                                                                                                           | And More                                                                                                  |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|
| <ul> <li>BLAS</li> <li>LAPACK</li> <li>ScaLAPACK</li> <li>Sparse BLAS</li> <li>Sparse Solvers</li> <li>Iterative</li> <li>PARDISO* SMP &amp; Cluster</li> </ul> | <ul><li>Multidimensional</li><li>FFTW interfaces</li><li>Cluster FFT</li></ul> | <ul> <li>Trigonometric</li> <li>Hyperbolic</li> <li>Exponential</li> <li>Log</li> <li>Power</li> <li>Root</li> </ul> | <ul> <li>Congruential</li> <li>Wichmann-Hill</li> <li>Mersenne<br/>Twister</li> <li>Sobol</li> <li>Neiderreiter</li> <li>Non-<br/>deterministic</li> </ul> | <ul> <li>Kurtosis</li> <li>Variation<br/>coefficient</li> <li>Order statistics</li> <li>Min/max</li> <li>Variance-<br/>covariance</li> </ul> | <ul> <li>Splines</li> <li>Interpolation</li> <li>Trust Region</li> <li>Fast Poisson<br/>Solver</li> </ul> |

# Intel<sup>®</sup> MKL Automatic Offload (AO)

- Control automatic offload (hybrid execution!)
  - Environment variable: MKL\_MIC\_ENABLE=1
  - Remember: sufficient problem size needed (Byte/FLOP ratio)
  - Service functions take precedence (work division, etc.)
- Supported functions
  - BLAS level 3: xGEMM, xTRMM, xTRSM
  - LAPACK: Cholesky, LU, QR
- Offload report (also applies to CAO)
  - OFFLOAD\_REPORT=<0|1|2>, or call
  - mkl\_mic\_set\_offload\_report(...)



# Programmability and performance

Intel<sup>®</sup> Xeon Phi<sup>™</sup>, Knights Corner (KNC)

## How and where to optimize?

- Choose a library that solves the problem
   or
- 2. Choose an appropriate algorithm and optimize your own code
  - a) Across SIMD lanes
  - b) Across multiple threads
  - c) Across multiple nodes



### **Intel Performance Library**

## Multithreading: Amdahl's law

• Speedup with n threads is limited by the *parallelizable* fraction P of the program

$$\Rightarrow S(n) = \frac{1}{(1-P) + \frac{P}{n}}$$

 Up to 240 threads may be needed by Intel<sup>®</sup> Xeon Phi<sup>™</sup>!



### Performance

- xGEMM, STREAM, and SMP Linpack <u>http://www.intel.com/content/www/us/en/benchmarks/xeon-phi-product-family-performance-brief.html</u>
- xGEMM, Cholesky / LU / QR Decomposition, SMP Linpack, etc. <u>http://software.intel.com/en-us/intel-mkl#pid-12768-1295</u>
- Example:



Configuration Into - Software Version: Intel<sup>®</sup> Math Kennel Duray Bente<sup>®</sup> Mail, 11.1.0.1. Hand<sup>®</sup> Manycore Parlom Software State (MPSS) 21.43446: Hadware: Orann Para Software Development System Intel<sup>®</sup> Kennel Duray Bente<sup>®</sup> Mail, 11.0.1. Hand<sup>®</sup> Manycore Parlom Software State (MPSS) 21.43446: Hadware: Orann Para Software Development System Intel<sup>®</sup> Kennel Duray Bente<sup>®</sup> Mail, 11.0.1. Hand<sup>®</sup> Manycore Parlom Software State (MPSS) 21.43446: Hadware: Orann Para Software Development System Intel<sup>®</sup> Kennel Duray Bente<sup>®</sup> Mail, 11.0.1. Hand<sup>®</sup> Manycore Parlom Software State (MPSS) 21.43446: Hadware: Orann Para Software Development System Intel<sup>®</sup> Kennel Duray Bente<sup>®</sup> Mail, 11.0.1. Hand<sup>®</sup> Manycore Parlom Software State (MPSS) 21.4346; Hadware: Orann Para Software Development System Intel<sup>®</sup> Kennel Duray Bente<sup>®</sup> Mail, 11.0.1. Hand<sup>®</sup> Manycore Parlom Software State (MPSS) 21.4346; Hadware: Orann Para Software Development System Intel<sup>®</sup> Kennel Duray Bente<sup>®</sup> Mail, 11.0.1. Hand<sup>®</sup> Manycore Parlom Software State (MPSS) 21.4346; Hadware: Orann Para Software Development System Intel<sup>®</sup> Kennel Duray Bente<sup>®</sup> Kenne

## Real application performance

- A number of real applications have reported a speedup on Intel<sup>®</sup> Xeon Phi<sup>™</sup>
- For references, see Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor

   Applications and Solutions Catalog
   <u>https://software.intel.com/en-us/articles/intel-</u>
   <u>xeon-phi-coprocessor-applications-and-solutions-</u>
   <u>catalog</u>

## Conclusions

- Vectorization and thread parallelism are keys to good performance on an Intel<sup>®</sup> Xeon Phi<sup>™</sup>
- Code modernization benefits both the processor and the coprocessor
- Some of the limitations of the current coprocessor generation will be removed with Knight's Landing

## References

- James Jeffers, James Reinders, "Intel® Xeon Phi™ Coprocessor High Performance Programming", Morgan Kaufmann, 2013.
- Alexander Supalov, Andrey Semin, Michael Klemm, Chris Dahnken, "Optimizing HPC Applications with Intel® Cluster Tools", Apress Open, 2015.
- PRACE Xeon Phi Best Practice Guide
   <u>http://www.prace-ri.eu/Best-Practice-Guide-Intel-Xeon-Phi-HTML</u>
- Xeon Phi Developer's Quick Start Guide
   <u>http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-developers-quick-start-guide</u>





## References

- James Jeffers, James Reinders, "High Performance Parallelism Pearls Volume One, 1st Edition. Multicore and Many-core Programming Approaches", Morgan Kaufmann, 2014.
- James Jeffers, James Reinders, "High Performance Parallelism Pearls Volume Two, 1st Edition. Multicore and Many-core Programming Approaches", Morgan Kaufmann, 2015.



