, ,

Dual-chip modules for Power E1050 server- IBM Power E1050

Power E1050 can configure a maximum of four DCMs when the 4-socket (4S) configuration is requested. Power E1050 also offers 2-socket and 3-socket configurations. Eight out of 16 memory OMI busses per Power10 chip are brought out to the module pins for a total of 16 OMI busses per DCM.

Eight OP (SMP) busses from each chip are bought to DCM module pins. Each chip has two x32 PCIe busses brought to DCM module pins.

The details of all busses that are brought out to DCM modules pins are shown in Figure 2-3.

Chapter 2. Architecture and technical overview                                                                      39

2.1.3 Power10 processor core

The Power10 processor core inherits the modular architecture of the Power9 processor core, but the re-designed and enhanced micro-architecture increases the processor core performance and processing efficiency. The peak computational throughput is markedly improved by new execution capabilities and optimized cache bandwidth characteristics. Extra Matrix Math Accelerator (MMA) engines can deliver performance gains for machine learning, particularly for AI inferencing workloads.

The Power E1050 server uses the Power10 enterprise-class processor variant in which each core can run with up to eight independent hardware threads. If all threads are active, the mode of operation is referred to as SMT8 mode. A Power10 core with SMT8 capability is named a Power10 SMT8 core or SMT8 core for short. The Power10 core also supports modes with four active threads (SMT4), two active threads (SMT2), and one single active thread (single-threaded (ST)).

The SMT8 core includes two execution resource domains. Each domain provides the functional units to service up to four hardware threads.

Figure 2-4 shows the functional units of an SMT8 core where all eight threads are active. The two execution resource domains are highlighted with colored backgrounds in two different shades of blue.

Figure 2-4 Power10 SMT8 core

Each of the two execution resource domains supports 1 – 4 threads and includes four vector scalar units (VSUs) of 128-bit width, two MMAs, and one quad-precision floating-point (QP) and decimal floating-point (DF) unit.

One VSU and the directly associated logic are called an execution slice. Two neighboring slices can also be used as a combined execution resource, which is then named super-slice. When operating in SMT8 mode, eight simultaneous multithreading (SMT) threads are subdivided in pairs that collectively run on two adjacent slices, as indicated through colored backgrounds in different shades of green.

40      IBM Power E1050: Technical Overview and Introduction

In SMT4 or lower thread modes, 1 – 2 threads each share a four-slice resource domain. Figure 2-4 on page 40 also indicates other essential resources that are shared among the SMT threads, such as an instruction cache, an instruction buffer, and an L1 data cache.

The SMT8 core supports automatic workload balancing to change the operational SMT thread level. Depending on the workload characteristics, the number of threads that is running on one chiplet can be reduced from four to two and even further to only one active thread. An individual thread can benefit in terms of performance if fewer threads run against the core’s executions resources.

Micro-architecture performance and efficiency optimization lead to an improvement of the performance per watt signature compared with the previous Power9 core implementation. The overall energy efficiency is better by a factor of approximately 2.6, which demonstrates the advancement in processor design that is manifested by Power10.

The Power10 processor core includes the following key features and improvements that affect performance:

Ê Enhanced load and store bandwidth

Ê Deeper and wider instruction windows

Ê Enhanced data prefetch

Ê Branch execution and prediction enhancements

Ê Instruction fusion

Enhancements in the area of computation resources, working set size, and data access latency are described next. The change in relation to the Power9 processor core implementation is provided in parentheses.

Enhanced computation resources

Here are the major computational resource enhancements:

Ê Eight VSU execution slices, each supporting 64-bit scalar or 128-bit single instructions multiple data (SIMD) +100% for permute, fixed-point, floating-point, and crypto (Advanced Encryption Standard (AES) and Secure Hash Algorithm (SHA)) +400% operations.

Ê Four units for MMA, each capable of producing a 512-bit result per cycle (new), and +400% Single and Double precision FLOPS plus support for reduced precision AI acceleration.

Ê Two units for QP and DF operations for more instruction types.

Larger working sets

The following major changes were implemented in working set sizes:

Ê L1 instruction cache: Two 48 KB 6-way (96 KB total) (+50%)

Ê L2 cache: 2 MB 8-way (+400%)

Ê L2 translation lookaside buffer (TLB): Two 4-K entries (8 K total) (+400%)

Data access with reduced latencies

The following major changes reduce latency for load data:

Ê L1 data cache access at four cycles nominal with zero penalty for store-forwarding (- 2 cycles) for store forwarding

Ê L2 data access at 13.5 cycles nominal (-2 cycles)

Ê L3 data access at 27.5 cycles nominal (-8 cycles)

Ê TLB access at 8.5 cycles nominal for effective-to-real address translation (ERAT) miss including for nested translation (-7 cycles)

Chapter 2. Architecture and technical overview                                                                               41

Micro-architectural innovations that complement physical and logic design techniques and specifically address energy efficiency include the following examples:

Ê Improved clock-gating.

Ê Reduced flush rates with improved branch prediction accuracy.

Ê Fusion and gather operating merging.

Ê Reduced number of ports and reduced access to selected structures.

Ê Effective address (EA)-tagged L1 data and instruction cache yield ERAT access only on a cache miss.

In addition to improvements in performance and energy efficiency, security represents a major architectural focus area. The Power10 processor core supports the following security features:

Ê Enhanced hardware support that provides improved performance while mitigating for speculation-based attacks.

Ê Dynamic Execution Control Register (DEXCR) support.

Ê Return-oriented programming (ROP) protection.

Related Posts