IBM revealed details of its 5.5GHz System z microprocessor chip in 32nm high-k CMOS at ISSCC this year.
Each chip has six processor cores, compared with four in its 45nm predecessor and, to form a plug-in processing element, six chips are mounted on a 96x96mm ceramic tile, along with two memory chips.
Each core has a 64k L1 instruction cache and a 96k L1 data cache, as well as pair of L2 1Mbyte data and instruction SRAMs, and the cores share 48Mbyte DRAM L3 cache on the chip.
"The L1 cache was an area of intense design focus to optimise performance, functionality, and efficiency. Both caches are built with a 0.291µm2 6T SRAM cell with single-phase read and dual-phase write to attain optimum area and power efficiency," said IBM.
Each processor die has 15 metal layers, occupies 598mm2, contains around 2.75bn transistors, and has 1071 signal IOs.
On top of this, there are two 32nm 192Mbyte L4 cache chips (3.3bn transistors covering 526mm2) per tile achieving a total 36 core bandwidth of 530Gbyte/s.
The glass ceramic tile itself is complex, with 102 layers, including 39 layers for signals, some carrying 2.75Gbit/s single-ended data.
EDA changes
Complexity in the cores and associated logic brought productivity challenges that were tacked with a design methodology called 'large block structured synthesis' (LBSS).
"This methodology incorporated novel algorithms to create structure in dataflow regions within the context of larger synthesised blocks," said IBM. "The success of this approach led to a fundamental shift in digital logic microprocessor design from a highly custom-oriented transistor-level approach to an automated gate-level synthesis-based LBSS methodology."
Global clock design also saw design efficiency improvements, where a dedicated automatic buffering and routing tool was used for distribution from the phase-locked loop (PLL).
To turn off unused cores, each core has its own synchronous grid and, for debug and tuning purposes, programmable delay and duty cycle control circuits are included at each core global clock input.
With such high frequencies, conductor heat could have affected reliability.
"To avoid Joule heating of wires, special buffers were designed with a custom via stack and output wiring to allow repeatable and reliable connections to the high-level wiring planes for driving large loads," said IBM. "In addition, the design tools were configured to select wire widths based on total load, with a detailed analysis carried out at the block level and the chip level."
Bias temperature instability (BTI) differs between SiO2 and high-k gate dielectrics, leading the firm to study how clock pulse-widths and latch transmission gates were affected by prolonged voltage stress.
"Using an in-house tool to analyse BTI effects on a transistor-by-transistor basis given the specific waveforms applied, and a BTI model developed for this technology, it was found that the write margins appear to actually increase over time," said IBM.
This was confirmed with statistical analysis with the derived end-of-life device parameter shifts.
Deep-trench decoupling capacitors on the processor chips (over 24µF on the main rail and almost 6µF total on other rails) removed the need for separate capacitors on the ceramic tile.
"Simulations and correlating measurements on the previous system have shown that any benefit of this additional capacitance, so far removed from the processor chips, is negligible," said IBM.
ISSCC 2013 paper 3.1 5.5GHz System z Microprocessor and Multichip Module