Daily Chip Digest: Microprocessors & Servers

SERVER 101

Server system technology
Platform = SoC processor + logic chipsets (North Bridge/South Bridge) + DIMM + Baseboard Management Controller + PCB + Graphic Cards

North Bridge = Memory controller + PCIe controller
South Bridge = SATA (HDD & SSD)/USB/Ethernet/MAC/Flash/SPI (serial peripheral interface)/LPC(low pin count)
SoC = CPU [+ GPU] + memory (cache) + memory controller + I/O controller (PCIe, SATA, USB, Gbe) + system bus
Baseboard management controller provides for remote monitoring/management of server boards by system admin
RAID: Redundant Array of Inexpensive Disks -> applied to HDD/SSD -> connect to processor through SATA interface
JBOD: Just a Bunch of Disks

Server Processor performance attributes: Multicore (parallel processing), multithreading (virtualization), clock speed, cache, memory bandwidth & latency, logic chipset, memory subsystem, system bus, technology node( 14 FinFet v/s 16nm), ISA ( instruction set architecture: x86 v ARM), ECC (error correction code), RAS (reliability, availability & serviceability)

Rack & blade servers Chassis of racks, with each rack supporting: servers, switches, networking & storage, peripherals, adapters & cards, power supplies, fans & cooling equipment
1RU or 1U is 1.75" high, 19" wide.

Typical racks can be 42U high and consume 12.5~30 KW of power, requiring ~5000CFM of airflow for cooling. Each server typically needs 160CFM of airflow.

Blade server can fit into 19" rack and may be 4U to 10U (or larger) in height - and can support multiple blades, with each blade being processor + logic chipsets + memory + storage + network +I/O controllers/interfaces

Blade servers also include fabric modules + switch modules + power management modules, sometime in separate cards in rack-mount chassis

Software Windows Server supports x86 only, not ARM
Commercially available Linux platforms:
Ubuntu (Canonical) -> supports x86, ARM
Suse Linux Enterprise Server /SLES (Novell) & Red Hat -> supports x86, not validated for ARM
Linaro: industry consortium to develop open source Linux software for ARM based SoC's

Server uP Tech Specs
Clock speed, L3 cache, DDR/PCI gen, # of cores, TDP, Tjmax, Technology Node

Datacenter Anatomy
CPU +components => NODES => BLADES/SLEDS + CHASSIS => RACKS => CLUSTERS

Board Components
SoC/Processor, memory modules (DIMM, SODIMM, etc), logic chipsets (Southbridge) that have interfaces for network, storage, graphics & other peripherals [PCI, SATA/IDE, AGP, LPC & SPI], power supplies & PMIC, HDD/SSD, GPU, heat sink, fans, wind tunnel / flow channel, cooling solution, switches, BMC

SoC = CPU [+ GPU] + memory (cache) + memory & I/O controller (PCIe, SATA, USB, Gbe) + system bus

Server uP IP / Macro Blocks
DDR, PCIE, L3 cache, SATA, USB, GPIO, PLL, SGMII, SERDES, APC/CBF, SRAM, UCI, PHY

ARMv8 64-bit
ARMv8 is new ISA (Instruction Set Architecture) that enables 64-bit computing improving CPU performance, paving the path forward for ARM based SoC's to compete with x86 in servers: Cortex A57 and Cortex A72 are standard ARMv8 cores, but there are several custom ARMv8 cores being developed in the industry

Microservers
Low power rack-mount high density modular approach, with each module or "sled" or "cartridge" being a full server, that may either be single or dual socketed
More suited to "scaleout" applications that rely on increasing/deploying more nodes & modules instead of "scaleup" that depend on boosting performance by increasing clock speed or adding more CPU cores. Cloud based solutions rely on distributed applications/computing and therefore tend to be supported by microservers - that improve power efficiency, redundancy, availability & cost management.

Intel
New u-architecture = TOCK; process node shrink = TICK
Haswell (22nm TOCK), Broadwell (14nm TICK)
Atom (Silvermont 22nm) -mobile SoC based core, optimized for low power, energy-efficient, scaleout applications
Xeon (Haswell 22nm or Broadwell 14nm)-> D (low end/high density server/microservers), E3 (desktop-client class/1S), E5 (mainstream or enhanced performance-EP/2S or 4S), E7 (mission critical or EX-expandibility, 4S+)
Itanium (65nm & 32nm), implementing IA-64 instruction set that is non x86 compatible
uP specs: Clock speed GHz, cache MB, DRAM bandwidth GB/s, TDP W, # cores & threads, bitwidth
SKYLAKE: 14nm TOCK (uarchitechture change), CANONLAKE: 10nm TICK (node shrink)
Denverton (Airmont 14nm) is follow on to Atom (Silvermont 22nm)

AMD
Opteron 3000: low power (25-65W), low cost, 1S server, Piledriver core
Delhi, Orochi AM3

Opteron 4000: mainstream (35-95W), 1S or 2S, Piledriver core
Seoul, Orochi C32

Opteron 6000: high performance (100-150W), 2S or 4S, Piledriver core
Abu Dhabi, Orochi G34 -> Warsaw refresh

Opteron X: Kyoto, low power (11-22W), microserver (Jaguar core / mobile class)
Opteron A: Seattle, ARM based server(25W), ARMv8 64-bit Cortex A57
Roadmap:
Berlin (Steamroller core) & Toronto (Excavator core)
Next-gen microarchitectures: Zen (14/16nm x86 core) and K12 (ARMv8)

Qualcomm

48 core / 24 duplex, single-threaded cores, ARMv8 64-bit, 12x5 =60MB L3 cache (1.25MB/core), 24 MB L2 (512 KB/core), bidirectional interconnect ring bus at 256Gb/s, 6 DDR4 (2667) channels supporting 768 GB of RAM at 128Gb/s bandwidth, 32 PCIe Gen3 lanes, 2.2-2.6GHz clock speed, 120W TDP, 398mm2, 10nm

https://www.anandtech.com/show/12025/qualcomm-launches-48core-centriq-for-1995-arm-servers-for-cloud-native-applications

https://hothardware.com/news/qualcomm-ships-48-core-centriq-2400-series-server-cpu

https://www.theregister.co.uk/2017/11/08/qualcomm_centriq_2400/

https://www.forbes.com/sites/davealtavilla/2017/11/08/qualcomm-launches-disruptive-48-core-centriq-server-processors-targeting-intels-bread-and-butter/#74fcec984e15

	ASIC / SoC Funtional Blocks (Interfaces, interconnects & controllers)
APC	Application Processor Core
CBF	Coherent Bus Fabric
L1/L2/L3	Cache
DDR	Double Data Rate Memory Controller
PCIe	Peripheral Component Interconnect Express
CCIX	Cache Coherent Interconnect
PLL	Phase Lock Loop
UCI	Unified Configuration Interface
SGMII	Serial Gigabit Media Independent Interface
GbE	Gigabit Ethernet
SERDES	Serializer Deserializer
USB	Universal Serial Bus
GPIO	General Purpose IO
SATA	Serial Advanced Technology Attachment
IDE	Integrated Drive Electronics
LPC	Low Pin Count
SPI	Serial Peripheral Interconnect
AGP	Accelerated Graphics Port

Qualcomm Centriq 2400

Half-width board designed to fit 1U bent-metal sleds. Full-width chassis may be configured for dual-node or storage rich options.

reference motherboard with OCP type 1 PCI-Express risers (red) and a network card in the nearest riser (front right with black heat sink)

Risers allow OCP compatible motherboard to have flexibility among configuration options in 1U chassis

- wide variety of PCI-Express add-in cards in several physical configurations including hosting “MegaCard” NVM-Express storage mezzanines or support for networking cards and GPU/FPGA accelerators.

Qualcomm’s MegaCard hosts twenty NVM-Express storage cards, ten on each side of the MegaCard. MegaCard takes the place of a second Centriq 2400 motherboard in a full-width 1U chassis, supporting a total of 80TB of PCI 3.0 NVM Express storage.

Qualcomm storage MegaCard close-up (top) and mounted in reference design chassis (bottom)

Risers give the OCP system customers access to a wide range of third-party add-in boards, permitting systems to host compute, storage, and network expansion capability and offload accelerators.

System supports high memory bandwidth architecture and is geared for highly-threaded workloads on scale-out applications. Targeted for applications such as search, content delivery networks, and memory-intensive data analytics.

Low speed IO's

Need lower speed interfaces to connect with peripherals, microcontrollers, LPC (low pin count) devices and EEPROM/flash memories on the board.

UART, I2C and SPI are serial communication interface protocols that support such LSIO requirements.

UART is simplest but the slowest.

I2C is faster than UART but slower than SPI, but allows multiple devices to be connected to one another.

SPI allows fastest speeds, but complexity increases with larger number of devices.

UART: Universal Asynchronous Receiver & Transmitter
I2C: Inter Integrated Circuit
SPI: Serial Peripheral Interface

Media Independent Interfaces (MII)

Media Independent Interfaces (MII) was originally developed as an interface between ethernet MAC and PHY, but later standardized by IEEE as interface protocol to allow connection between a wide range of MAC (media access control) and PHY devices, independent of media and network, to support reduced signals & increased speeds:

Reduced media-independent interface (RMII)
Gigabit media-independent interface (GMII)
Reduced gigabit media-independent interface (RGMII)
Serial gigabit media-independent interface (SGMII)
High serial gigabit media-independent interface (HSGMII)
Quad serial gigabit media-independent interface (QSGMII)
10-gigabit media-independent interface (XGMII)

The Intelligent Platform Management Interface (IPMI) is a remote hardware health monitoring and management system that defines interfaces for use in monitoring the physical health of servers, such as temperature, voltage, fans, power supplies and chassis. It was developed by Dell, HP, Intel and NEC, but has many more industry promoters, adopters and contributors.

From <https://www.webopedia.com/TERM/I/Intelligent_Platform_Management_Interface_IPMI.html>

Microprocessor Trends

Historically, in addition to shrinking technological process nodes, uP history has been characterized in terms of instruction sets (RISC v/s CISC), clock speed improvements, increasing core counts, hyperthreading & virtualization, 32 bit to 64 bit processing, SoC integration (for eg APU = CPU + GPU cores) - and more recently, high bandwidth memory adoption & power and efficiency improvements with ARMv8 microarchitectures over performance optimized but power hungry x86 (perf/$ to perf/$/W).

Other trends include PCIe evolution, cache coherent interfaces (CCIX) & GPU/FPGA accelerator integration.

Modern CPU's use high speed memory interconnects (DDR), chip-to-chip interconnects cache coherent - typically governed by proprietary standards, low speed links (USB/SATA) for low-level management (such as HDD/IDE/LPC/SPI) and PCIe interconnects to support all other critical interfaces (such as SERDES).

Memory advancements are trending to technologies supporting increasing bandwidths. This includes DDR4/5, HBM2/3, GDDR6 & HMC type memories. CPU memories are typically characterized in terms of generation, speed (MT/s), bandwidth (Gbps), total capacity (GB), # of channels, # of DIMM/channel (DPC).

Need for higher performance at lower latencies has steadily been increasing PCIe bandwidth over time. PCIe Gen 3 supports 8Gbps, while Gen 4 allows 16Gbps. The higher bandwidth of PCIe Gen 4 is expected to directly benefit the enablement of NVM (SSD) storage and GPU/FPGA accelerator integration in datacenters. PCIe is characterized in terms of generation, bandwidth (Gbps) and # of supported lanes.

Accelerator (GPU/FPGA) hardware integration is already taking place today through chip-to-chip interfaces based on proprietary standards. But by developing a common standard (CCIX) that can offload accelerator interfacing to PCIe, this can ease bandwidth bottlenecks while allowing flexibility in terms of seamless integration across a multitude of platforms.

https://www.nextplatform.com/2019/09/18/eating-the-interconnect-alphabet-soup-with-intels-cxl/

These include the Compute Express Link (CXL) from Intel, the Coherent Accelerator Interface (CAPI) from IBM, the Cache Coherence Interconnect for Accelerators (CCIX) from Xilinx, and the Infinity Fabric from AMD. Other interconnects try to get around some of the limitations of bandwidth or latency inherent in the PCI-Express bus, such as the NVLink interconnect from Nvidia and the OpenCAPI interconnect from IBM. The Gen-Z interconnect from Hewlett Packard Enterprise links out from PCI-Express on servers to silicon photonics bridges and switches that hold out the promise a memory centric – rather than compute centric – architecture for systems. It can be used to hook anything from DRAM to flash to accelerators in meshes with any manner of CPU.

https://www.nextplatform.com/2019/03/15/intel-offers-up-yet-another-accelerator-interconnect-technology/

https://www.nextplatform.com/2020/04/03/cxl-and-gen-z-iron-out-a-coherent-interconnect-strategy/

CPU & FPGA both integrate memory & logic, yet are different from each other. CPU is good at performing a wide variety of tasks, but an FPGA is best at running specific workloads that repeat multiple times, particularly if they are parallelized, and will only change only occasionally. This allows FPGA use to offload and accelerate tasks in conjunction with the CPU. The industry is trending to heterogenous CPU/FPGA architecture integration at the SoC level by using FPGA fabrics as IP blocks in the silicon that is expected to improve latencies & bottlenecks while reducing power consumption by eliminating the need for all intermediate interfaces and interconnects such as SERDES/PCIe/CCIX.

Storage has traditionally been HDD based working off data transfer protocols such as SAS & SATA
SAS: Serial Attached SCSI (speeds upto 6Gbps)
SATA: Serial ATA (speeds upto 12 Gbps)

With the move to SSD/flash storage, and the drive to enable faster speeds to support datacenter processing needs, while some legacy systems still use SAS/SATA to support SSD, modern systems use NVMe - a new communication/transfer protocol offering much higher data transfer speeds - to support SSD access over PCIe lanes. NVMe connections to SSD's are typically made through PCIe riser/expansion cards, a 2.5inch U.2 connector or an M.2 small form factor.

Though M.2 and U.2 use the same type of flash memory storage, they come in entirely different form factors. M.2 is a small, flat, board while M.2 is the 2.5” form factor you’re familiar with from most SATA SSDs.
Capacity – Because of the larger form factor, U.2 has a higher storage capacity – around 4TB+ compared to the 2TB max from M.2.

<https://www.velocitymicro.com/blog/m-2-vs-u-2/>

Nvidia's CUDA (Compute Unified Device Architecture) is a parallel processing compute architectural protocol that allows CPU's to be accelerated by a group of GPU's through appropriate interfaces, networks, switches and fabric. CUDA was originally developed for x86 processors, but has more recently also been configured to provide support to ARM based microarchitectures.

Another trend relates to server virtualization that can either be done using VM's and hypervisors that allow multiple VM's to run multiple applications on potentially different operating systems (OS) - or containers that run multiple applications on the same OS.

Virtualization creates simulated abstraction of physical (compute, storage and networking) resources that generates images of the resource (virtual machines) that can be used in a way identical to the physical resource. For instance, this might allow a computer to be imaged across several virtual computing machines that are partitioned from each other, running multiple apps and operating systems different from one another, but managed by a hypervisor. The imaged VM's are also partitioned & sandboxed from the main physical system, so VM's are ideal for testing out use cases like beta-version of software releases, virus infected data or apps not intended for the base OS. Virtualization allows more efficient & effective use of resources, providing scalability and flexibility, as needed, helping reduce infrastructural cost/ capex as well as lowering maintenance overheads / opex by reducing needs for thermal & power management.

<https://www.educba.com/what-is-kubernetes/>

Containers do not bundle the operating system as opposed to virtual machines. Containers contain the app code, run time, system tools, libraries, and settings. Containers are lighter, more portable and more effective than virtual machines. Kubernetes is a container management tool. The main goal of this tool is deploying containers, scaling and descaling containers, balancing containers load.

https://azure.microsoft.com/en-us/topic/what-is-kubernetes/

Kubernetes is open-source orchestration software for deploying, managing, and scaling containers

Modern applications are increasingly built using containers—microservices packaged with their dependencies and configurations. Kubernetes, or k8s for short, is open-source software for deploying and managing those containers at scale. With Kubernetes, you can build, deliver, and scale containerized apps faster.

A new developing trend allows abstraction of server resources (compute/acceleration, storage, networking/connectivity) allowing flexibility in hardware configurations that can be disaggregated and then recomposed in a wide variety of ways, depending on the dynamic nature of workload requirements, typically achieved through peripheral transport (PCI) buses & proprietary or industry-standard switching fabrics (Liqid). This creates a reconfigurable infrastructural pool where hardware resources (such as CPU/GPU/NVMe/SAS) are not necessarily limited to individual physical nodes, but can be shared as needed, allowing scalability, availability & flexibility of resources on-demand. This technology across clusters or networks with Ethernet or Infiniband, and applied for bare metal servers, virtual machines or containers.

Rise of the ARM Servers

Scaleout needs for cloud & edge computing is achieved through implementation of highly modular, low power and highly dense / tightly packed (blade & micro) server nodes pushing the price-performance per watt envelope, while providing options for scalability, improved COO/TCO through energy efficiency / lower costs & improved RAS (Reliability, Availability & Serviceability). This is the market where ARM servers provide attractive value proposition - and several players in the industry (such as Cavium/Marvell, Applied Micro/Ampere) have been gaining traction, slowly but surely.

Value proposition for ARM servers in Cloud & Edge Computing:

Improved price per performance per watt, lowers total cost of ownership
Versatility to adapt to changing workloads
Adaptability to customize design to specific tasks or applications
An alternative to the Intel hegemony
Flexibility in supply chain

ARM server applications

Android based cloud gaming
Image processing / computer vision neural networks for AI/ML in the cloud (training)
Real time object detection / pattern recognition for AI/ML at the edge (inference)
Web applications / hosting
Search & content delivery
Database, Storage
Analytics (?)
Media transcoding
Virtualization & Containers (?)

ARM-based cloud native processors are specifically designed / optimized for cloud computing and well suited for containerized applications and microservices. Using single threaded devices with high core count improves data isolation and allows running multiple applications with higher predictability, improved performance & reduced energy efficiency , as evidenced by metrics such as SIR/TDP, watts/core count, SIR/TCO, and core or performance density/rack.

An instantiation of ARM based Cloud Computing is the Android Cloud (gaming, app development, enterprise mobile). This uses ARM servers (CPU's) working in conjunction with GPU's and encoders to run Android based Containers /VM's (applications) through virtualization and device emulation.

https://www.nextplatform.com/2021/05/05/soc-driven-inference-datacenters-becoming-new-reality/

Existing architecture of a CPU-centric approach