, ,

Simultaneous multithreading- IBM Power E1050

Each core of the Power10 processor supports multiple hardware threads that represent independent execution contexts. If only one hardware thread is used, the processor core runs in ST mode.

If more than one hardware thread is active, the processor runs in SMT mode. In addition to the ST mode, the Power10 processor supports the following different SMT modes:

Ê SMT2: Two hardware threads active

Ê SMT4: Four hardware threads active

Ê SMT8: Eight hardware threads active

SMT enables a single physical processor core to simultaneously dispatch instructions from more than one hardware thread context. Computational workloads can use the processor core’s execution units with a higher degree of parallelism. This ability enhances the throughput and scalability of multi-threaded applications and optimizes the compute density for ST workloads.

SMT is primarily beneficial in commercial environments where the speed of an individual transaction is not as critical as the total number of transactions that are performed. SMT typically increases the throughput of most workloads, especially those workloads with large or frequently changing working sets, such as database servers and web servers.

42      IBM Power E1050: Technical Overview and Introduction

Table 2-2 lists a historic account of the SMT capabilities that are supported by each implementation of the IBM Power Architecture® since IBM Power4.

Table 2-2 SMT levels that are supported by Power processors

a. Power Hypervisor supports a maximum 240 SMT8 threads, that is, 1920. AIX support up to 1920 (240 SMT8) total threads in a single partition, starting with AIX 7.3 + Power10.

The Power E1050 server supports the ST, SMT2, SMT4, and SMT8 hardware threading modes. With the maximum number of 96 cores, a maximum of 768 hardware threads per partition can be reached.

2.1.5 Matrix Math Acceleration AI workload acceleration

The MMA facility was introduced by the Power ISA 3.1. The related instructions implement numerical linear algebra operations on small matrices and are meant to accelerate computation-intensive kernels, such as matrix multiplication, convolution, and discrete Fourier transform.

To efficiently accelerate MMA operations, the Power10 processor core implements a dense math engine (DME) microarchitecture that effectively provides an accelerator for cognitive computing, machine learning, and AI inferencing workloads.

The DME encapsulates compute efficient pipelines, a physical register file, and an associated data flow that keeps the resulting accumulator data local to the compute units. Each MMA pipeline performs outer-product matrix operations, reading from and writing back to a 512-bit accumulator register.

Power10 implements the MMA accumulator architecture without adding an architected state. Each architected 512-bit accumulator register is backed by four 128-bit Vector Scalar eXtension (VSX) registers.

Code that uses the MMA instructions is included in OpenBLAS and Eigen libraries. This library can be built by using the most recent versions of the GNU Compiler Collection (GCC) compiler. The latest version of OpenBLAS is available at this web page.

OpenBLAS is used by the Python-NumPy library, PyTorch, and other frameworks, which makes it easy to use the performance benefit of the Power10 MMA accelerator for AI workloads.

Chapter 2. Architecture and technical overview                                                                      43

The Power10 MMA accelerator technology is also used by the IBM Engineering and Scientific Subroutine Library for AIX on Power 7.1 (program number 5765-EAP).

Program code that is written in C/C++ or Fortran can benefit from the potential performance gains by using the MMA facility if the code is compiled by the following IBM compiler products:

Ê IBM Open XL C/C++ for AIX 17.1 (program numbers 5765-J18, 5765-J16, and 5725-C72)

Ê IBM Open XL Fortran for AIX 17.1 (program numbers 5765-J19, 5765-J17, and 5725-C74)

For more information about the implementation of the Power10 processor’s high throughput math engine, see A matrix math facility for Power ISA processors.

For more information about fundamental MMA architecture principles with detailed instruction set usage, register file management concepts, and various supporting facilities, see Matrix-Multiply Assist Best Practices Guide, REDP-5612.

2.1.6 Power10 compatibility modes

The Power10 core implements the Processor Compatibility Register (PCR) as described in the Power ISA 3.1, primarily to facilitate Live Partition Mobility (LPM) to and from previous generations of IBM Power hardware.

Depending on the specific settings of the PCR, the Power10 core runs in a compatibility mode that pertains to Power9 (Power ISA 3.0) or Power8 (Power ISA 2.07) processors. The support for processor compatibility modes also enables older operating system (OS) versions of AIX, IBM i, Linux, or Virtual I/O Server (VIOS) environments to run on Power10 processor-based systems.

Note: The Power E 1050 server does not support IBM i.

The Power10 processor-based Power E1050 server supports the Power8, Power9 Base, Power9, and Power10 compatibility modes.

Related Posts