The Power E1050 server can mirror the Power Hypervisor code across multiple memory DDIMMs. If a DDIMM that contains the hypervisor code develops an uncorrectable error, its mirrored partner enables the system to continue to operate uninterrupted.
Active Memory Mirroring (AMM) is an optional feature (#EM81).
The hypervisor code logical memory blocks are mirrored on distinct DDIMMs to enable more usable memory. There is no specific DDIMM that hosts the hypervisor memory blocks, so the mirroring is done at the logical memory block level, not at the DDIMM level. To enable the AMM feature, the server must have enough free memory to accommodate the mirrored memory blocks.
Chapter 2. Architecture and technical overview 63
Besides the hypervisor code itself, other components that are vital to the server operation are also mirrored:
Ê Hardware page tables (HPTs), which are responsible for tracking the state of the memory pages that are assigned to partitions
Ê Translation control entities (TCEs), which are responsible for providing I/O buffers for the partition’s communications
Ê Memory that is used by the hypervisor to maintain partition configuration, I/O states, virtual I/O information, and partition state It is possible to check whether the AMM option is enabled and changes its status by using the HMC. The relevant information and controls are in the Memory Mirroring section of the General Settings window of the selected Power E1050 server (Figure 2-12).
Figure 2-12 Memory Mirroring section in the General Settings window on the HMC enhanced GUI
After a failure occurs on one of the DDIMMs that contains hypervisor data, all the server operations remain active and the enterprise Baseboard Management Controller (eBMC) service processor isolates the failing DDIMMs. The system stays in the partially mirrored state until the failing DDIMM is replaced.
Memory that is used to hold the contents of platform dumps is not mirrored, and AMM does not mirror partition data. AMM mirrors only the hypervisor code and its components to protect this data against a DDIMM failure. With AMM, uncorrectable errors in data that are owned by a partition or application are handled by the existing Special Uncorrectable Error (SUE) handling methods in the hardware, firmware, and OS.
SUE handling prevents an uncorrectable error in memory or cache from immediately causing the system to stop. Rather, the system tags the data and determines whether it will be used again. If the error is irrelevant, it does not force a checkstop. If the data is used, termination can be limited to the program/kernel or hypervisor owning the data, or freeze of the I/O adapters that are controlled by an I/O hub controller if data must be transferred to an I/O device.
64 IBM Power E1050: Technical Overview and Introduction