Skip to main content

Glossary

Comprehensive dictionary of alternative computing terminology

175 terms across 17 categories

Architectures

Dataflow Architecture
Dataflow architecture represents a fundamental departure from traditional sequential computing. Instead of executing instructions in program order, dataflow systems execute operations as soon as their input operands become available. This data-driven approach naturally exposes parallelism and eliminates many synchronization overheads. Modern AI accelerators often incorporate dataflow principles, where tensors flow through networks of processing elements. Companies like SambaNova and Cerebras use dataflow concepts to achieve high utilization when executing neural network computations.
Harvard Architecture
The Harvard architecture separates storage and signal pathways for instructions and data, allowing simultaneous access to both. Originally developed for the Harvard Mark I computer, this design eliminates the Von Neumann bottleneck by enabling the CPU to fetch the next instruction while reading or writing data. Modern processors often use a modified Harvard architecture with separate L1 caches for instructions and data while maintaining a unified main memory. This approach is particularly common in DSPs and microcontrollers where deterministic performance is critical.
Heterogeneous Computing
Heterogeneous computing combines different processor types—CPUs, GPUs, FPGAs, and specialized accelerators—in a single system to optimize for diverse workloads. Each processor type excels at different tasks: CPUs for sequential logic, GPUs for parallel computation, FPGAs for custom datapaths. Modern SoCs like Apple's M-series chips exemplify this approach, integrating CPU cores, GPU, Neural Engine, and media accelerators. The challenge lies in efficiently partitioning work and managing data movement between processors.
MIMD
Multiple Instruction, Multiple Data (MIMD) describes systems where multiple processors independently execute different instructions on different data. This is the most flexible parallel architecture, encompassing everything from multi-core CPUs to distributed computing clusters. MIMD systems can run completely independent programs or cooperate on shared problems through message passing or shared memory. The challenge lies in synchronization and communication overhead, which becomes increasingly significant as system scale grows.
Near-Memory Computing
Near-memory computing addresses the growing gap between processor speed and memory bandwidth by placing compute units close to or within memory. As AI models grow larger, data movement dominates energy consumption and limits performance. Solutions range from 3D-stacked memory with logic layers (like HBM-PIM) to processing-in-memory architectures using analog computation in memory arrays. This approach can reduce data movement energy by orders of magnitude for memory-bound workloads like neural network inference.
SIMD
Single Instruction, Multiple Data (SIMD) is a parallel processing paradigm where one instruction operates on multiple data points simultaneously. Modern CPUs include SIMD units like Intel's AVX-512, which can process 512 bits of data per instruction. SIMD is essential for multimedia processing, scientific computing, and increasingly for AI workloads. The programming model requires data to be organized in vectors, and performance depends heavily on alignment and avoiding branches that cause divergence across lanes.
SIMT
Single Instruction, Multiple Threads (SIMT) is NVIDIA's execution model for GPUs, extending SIMD concepts to thousands of threads. In SIMT, groups of threads (warps of 32 threads in NVIDIA GPUs) execute the same instruction but can follow different execution paths through predication. When threads diverge at branches, all paths must be executed serially, reducing efficiency. Understanding SIMT is crucial for GPU programming, as thread divergence is a primary source of performance loss in poorly optimized GPU code.
Spatial Computing
Spatial computing architectures map computations onto a physical network of processing elements, with data flowing through the network rather than being fetched from memory. Unlike temporal architectures that reuse a few processing units over time, spatial architectures achieve parallelism through physical replication. Modern AI accelerators often use spatial designs where neural network layers are mapped onto arrays of processing elements. This approach maximizes data reuse and minimizes memory bandwidth requirements when the computation pattern is regular and predictable.
Systolic Array
A systolic array is a network of processing elements that rhythmically compute and pass data through the system, similar to blood pulsing through the circulatory system. Each element performs a simple operation and passes results to neighbors in a regular pattern. Google's TPU uses massive systolic arrays for matrix multiplication, the core operation in neural networks. The regular structure makes systolic arrays highly efficient for specific computations, achieving near-peak utilization of compute resources while minimizing memory bandwidth requirements through data reuse.
VLIW
Very Long Instruction Word (VLIW) architectures pack multiple operations into each instruction, relying on the compiler to identify parallelism and schedule operations statically. Unlike superscalar processors that dynamically find parallelism at runtime, VLIW shifts this complexity to compile time, simplifying hardware. Intel's Itanium was a notable VLIW attempt for general computing. Today, VLIW concepts live on in DSPs and some AI accelerators where predictable workloads allow effective compile-time scheduling.
Von Neumann Architecture
The Von Neumann architecture is the foundational design for most modern computers, proposed by mathematician John von Neumann in 1945. In this architecture, a single memory stores both program instructions and data, connected to the CPU via a shared bus. While elegant and flexible, this design creates the "Von Neumann bottleneck" where the processor must wait for data transfers between memory and CPU. This limitation has driven the development of caches, prefetching, and alternative architectures to improve performance.

GPUs

Compute Unit
Compute Units (CUs) are AMD's equivalent to NVIDIA's Streaming Multiprocessors, serving as the primary building blocks of AMD GPUs. Each CU contains stream processors, local data share memory, and scheduling hardware. AMD's RDNA architecture for gaming and CDNA architecture for compute differ in CU design optimizations. The MI300X accelerator contains 304 CUs, delivering massive parallel computing capability for AI workloads. Understanding CU architecture is crucial for optimizing HIP and OpenCL code on AMD hardware.
CUDA
CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and programming model that transformed GPUs into general-purpose computing devices. Launched in 2007, CUDA provides C/C++ extensions for writing GPU programs and a rich ecosystem of libraries for linear algebra, FFT, deep learning, and more. CUDA's decade-long head start created a powerful moat for NVIDIA: most AI frameworks, research code, and enterprise applications are built on CUDA. This software ecosystem, as much as hardware performance, drives NVIDIA's dominance in AI computing.
GDDR
Graphics DDR (GDDR) is high-speed memory designed for graphics cards and gaming consoles. GDDR6 and GDDR6X are current generations, offering bandwidth up to 1 TB/s at lower cost than HBM. While HBM dominates high-end AI accelerators, GDDR remains important for consumer GPUs and some AI inference chips where cost matters more than peak bandwidth. GDDR's wider interface and lower stacking complexity make it more economical for applications that don't require HBM's extreme bandwidth.
GPU
Graphics Processing Units evolved from specialized graphics hardware into massively parallel processors that now dominate AI computing. Modern GPUs contain thousands of cores optimized for throughput rather than latency, making them ideal for the matrix operations central to deep learning. NVIDIA's data center GPUs like the H100 and H200 have become the standard for AI training, while AMD and Intel compete with alternatives. The GPU's success in AI stems from its high memory bandwidth, massive parallelism, and mature software ecosystem including CUDA and cuDNN.
HBM
High Bandwidth Memory (HBM) is a 3D-stacked DRAM technology that provides exceptional memory bandwidth for GPUs and AI accelerators. By stacking multiple DRAM dies vertically and connecting them with through-silicon vias (TSVs), HBM achieves bandwidths exceeding 3 TB/s in the latest HBM3e specification. The technology trades capacity for bandwidth and sits on an interposer adjacent to the processor die. HBM's high cost and complex manufacturing make it exclusive to high-end accelerators, but its bandwidth is essential for training large AI models.
Memory Bandwidth
Memory bandwidth measures how fast data can be transferred between processor and memory, typically expressed in GB/s or TB/s. For AI workloads, memory bandwidth often determines performance more than raw compute capability, especially for inference and memory-bound training operations. The H100 GPU delivers 3.35 TB/s of HBM3 bandwidth, while consumer GPUs offer 500-1000 GB/s. The ratio of compute capability to memory bandwidth (arithmetic intensity) determines whether a workload is compute-bound or memory-bound.
Multi-Instance GPU
Multi-Instance GPU (MIG) is NVIDIA's technology for partitioning a single GPU into multiple isolated instances, each with dedicated compute, memory, and bandwidth resources. Introduced with the A100, MIG enables multiple users or workloads to share expensive GPU hardware with guaranteed quality of service. An A100 can be split into up to seven instances, each appearing as a separate GPU to software. MIG is particularly valuable in cloud environments where GPU resources must be efficiently shared among multiple tenants or diverse inference workloads.
OpenCL
OpenCL (Open Computing Language) is an open standard for parallel programming across CPUs, GPUs, and other accelerators, maintained by the Khronos Group. Unlike CUDA, OpenCL works across vendors including AMD, Intel, and ARM. However, OpenCL's vendor-neutral design often results in lower performance than vendor-specific solutions, and its ecosystem lacks CUDA's depth of optimized libraries. OpenCL remains relevant for cross-platform applications and in environments where vendor lock-in is unacceptable.
Ray Tracing
Ray tracing simulates the physical behavior of light to create photorealistic graphics by tracing rays from the camera through each pixel into the scene. While computationally expensive, dedicated RT cores in modern GPUs (NVIDIA RTX, AMD RDNA2+) accelerate the critical ray-scene intersection tests. Beyond gaming, ray tracing has applications in scientific visualization, architectural rendering, and training autonomous vehicles. The same spatial acceleration structures used for ray tracing can accelerate certain AI and physics simulations.
ROCm
ROCm (Radeon Open Compute) is AMD's open-source alternative to CUDA for GPU computing. It provides a CUDA-compatible programming interface (HIP) that allows many CUDA applications to be ported with minimal changes. While ROCm has matured significantly, supporting major frameworks like PyTorch and TensorFlow, it still trails CUDA in library completeness and optimization. AMD's MI300 series accelerators target data center AI workloads, and ROCm's continued development is crucial for providing competition in the AI accelerator market.
Shader
Shaders are programmable processing units in GPUs that originated for graphics rendering but now serve general-purpose computing. In graphics, vertex shaders transform geometry while pixel (fragment) shaders determine final colors. Compute shaders extend this model for non-graphics workloads. The term reflects GPUs' graphics heritage, though modern GPU computing rarely involves traditional shading. Understanding shader architecture helps explain GPU design choices like wide SIMD units and the importance of occupancy for hiding memory latency.
Streaming Multiprocessor
The Streaming Multiprocessor (SM) is the fundamental compute building block in NVIDIA GPUs. Each SM contains CUDA cores, Tensor Cores, shared memory, and schedulers that manage thread execution. A high-end GPU like the H100 contains over 100 SMs. Programming efficiency depends on keeping SMs occupied with enough warps to hide memory latency. Understanding SM architecture—register files, shared memory limits, warp scheduling—is essential for writing high-performance CUDA code.
TDP
Thermal Design Power (TDP) specifies the maximum heat output a chip is designed to produce under sustained load, guiding cooling system requirements. Modern AI GPUs have TDPs of 350-700W, requiring sophisticated cooling solutions in data centers. TDP relates to but doesn't equal actual power consumption, which varies with workload. As AI accelerators push power limits, TDP has become a key constraint driving interest in liquid cooling and more efficient accelerator architectures.
Tensor Core
Tensor Cores are specialized processing units in NVIDIA GPUs designed for the matrix multiply-accumulate operations fundamental to deep learning. First introduced in the Volta architecture, Tensor Cores perform mixed-precision matrix operations much faster than standard CUDA cores. The H100's fourth-generation Tensor Cores support FP8 precision and can deliver up to 4 petaflops of AI performance. Tensor Cores have been key to NVIDIA's AI performance leadership, enabling efficient training and inference of increasingly large models.
VRAM
Video RAM (VRAM) is the dedicated high-bandwidth memory on graphics cards, critical for both graphics rendering and AI workloads. Modern AI GPUs feature 24-192GB of VRAM, and capacity often limits the size of models that can be trained or run. Large language models can require hundreds of gigabytes of memory, driving demand for multi-GPU systems and memory optimization techniques. VRAM bandwidth, typically measured in TB/s for modern GPUs, directly impacts AI performance for memory-bound operations common in inference.
Vulkan
Vulkan is a modern, low-overhead graphics and compute API that gives developers more direct control over GPU hardware than older APIs like OpenGL. Developed by Khronos Group, Vulkan reduces driver overhead and exposes features like explicit memory management and multi-threaded command recording. While primarily used for graphics, Vulkan's compute shaders support general-purpose GPU computing. The API's complexity makes it challenging to use but enables higher performance for applications that can manage the additional responsibility.

AI Accelerators

AI Inference Chip
AI inference chips are accelerators optimized for running trained neural networks in production, as opposed to training. Inference has different requirements than training: lower precision suffices (INT8 vs FP16/FP32), batch sizes are often smaller, and latency may matter more than throughput. Companies like Qualcomm, Intel, and numerous startups offer inference-focused chips for data center, edge, and mobile deployment. The inference market is more fragmented than training, where NVIDIA dominates, because diverse deployment environments favor specialized solutions.
AI Training Chip
AI training chips are accelerators designed for the computationally intensive process of training neural networks, requiring high memory bandwidth, large memory capacity, and strong floating-point performance. Training large language models can take weeks on thousands of accelerators, making efficiency crucial. NVIDIA's H100 and AMD's MI300X dominate this market, with Google's TPUs and custom chips from hyperscalers providing alternatives. Training chips must support high-precision formats (FP32, FP16, BF16) for gradient calculations and backpropagation.
IPU
Graphcore's Intelligence Processing Unit (IPU) is designed for machine intelligence workloads using a unique architecture with thousands of independent processor cores and distributed on-chip memory. Unlike GPUs that rely on external memory bandwidth, IPUs keep model parameters in distributed SRAM close to compute. The IPU's Bulk Synchronous Parallel (BSP) execution model differs fundamentally from GPU programming. While Graphcore has faced business challenges, the IPU architecture demonstrated innovative approaches to AI acceleration that influenced the broader industry.
LPU
Groq's Language Processing Unit (LPU) is a novel accelerator architecture designed specifically for large language model inference. Unlike GPUs that use complex scheduling and caching, Groq's LPU uses a deterministic, software-scheduled approach where computation timing is known at compile time. This eliminates memory bandwidth bottlenecks that limit LLM inference speed on GPUs. Groq has demonstrated industry-leading tokens-per-second performance, making LPUs particularly compelling for latency-sensitive LLM applications where inference speed directly impacts user experience.
Matrix Engine
Matrix engines are dedicated hardware units optimized for matrix multiplication, the dominant operation in neural networks. Found in TPUs, AI accelerators, and modern GPUs (as Tensor Cores), matrix engines achieve much higher efficiency than general-purpose compute units for this specific operation. These engines typically support specialized data types (FP16, BF16, INT8, FP8) and perform fused multiply-accumulate operations. The efficiency gains come from optimized dataflow, reduced instruction overhead, and specialized number formats.
NPU
Neural Processing Units (NPUs) are specialized AI accelerators optimized for neural network inference, typically integrated into mobile SoCs, laptops, and edge devices. Unlike power-hungry data center accelerators, NPUs emphasize efficiency within tight power budgets of 1-15W. Apple's Neural Engine, Qualcomm's Hexagon, and Intel's NPU enable on-device AI for features like computational photography, voice recognition, and real-time translation. NPUs are becoming standard in modern processors as AI capabilities move from cloud to edge.
TFLOPS
Tera Floating-Point Operations Per Second (TFLOPS) measures computational throughput for floating-point calculations, critical for AI training. Different precisions yield different TFLOPS: an H100 delivers 67 TFLOPS at FP32, 989 TFLOPS at FP16 with sparsity, and 3958 TFLOPS at FP8. Higher TFLOPS at lower precision reflects the hardware trend toward specialized AI formats. Like TOPS, TFLOPS indicates peak capability; actual performance depends on memory bandwidth, utilization, and workload characteristics.
Tokens Per Second
Tokens per second measures LLM inference throughput—how quickly a model generates output text. For interactive applications like chatbots, higher tokens/second means lower latency and better user experience. Throughput depends on model size, hardware capability, batch size, and sequence length. Groq's LPU has achieved over 500 tokens/second on LLaMA models, while typical GPU inference delivers 30-100 tokens/second per user. This metric has become increasingly important as LLM applications demand real-time responsiveness.
TOPS
Tera Operations Per Second (TOPS) measures AI accelerator performance, typically for integer operations (INT8) used in inference. One TOPS equals one trillion operations per second. Marketing claims often cite peak TOPS without context about achievable performance on real workloads. Useful TOPS depends on memory bandwidth, software efficiency, and workload characteristics. When comparing accelerators, consider TOPS alongside memory bandwidth, supported precisions, power consumption, and benchmark performance on representative workloads.
TPU
Google's Tensor Processing Units (TPUs) are custom ASICs designed specifically for machine learning workloads. First deployed in 2015, TPUs use systolic array architecture optimized for the matrix operations central to neural networks. TPUs are available through Google Cloud and power many Google services including Search and Translate. The latest TPU v5p delivers exceptional performance for large language model training. Unlike general-purpose GPUs, TPUs sacrifice flexibility for efficiency on neural network computations.
Wafer-Scale Engine
Cerebras's Wafer-Scale Engine (WSE) takes an unprecedented approach: building an entire AI processor on a single silicon wafer rather than dicing it into individual chips. The WSE-3 contains 4 trillion transistors, 900,000 cores, and 44GB of on-chip SRAM across a wafer-sized die. This eliminates chip-to-chip communication bottlenecks that limit multi-GPU systems. The approach requires solving immense engineering challenges in yield, power delivery, and cooling. Cerebras has demonstrated strong performance on large AI models where their architecture's advantages shine.

Photonics

CPO
Co-Packaged Optics (CPO) integrates optical transceivers directly into switch and accelerator packages, dramatically reducing power consumption and latency compared to pluggable optics. By shortening the electrical path between chip and optics, CPO eliminates the need for power-hungry retimers and enables higher bandwidth density. CPO is becoming essential for AI clusters where thousands of accelerators must communicate with minimal overhead. Major players including Broadcom, NVIDIA, and hyperscalers are developing CPO solutions for next-generation AI infrastructure.
Electro-Optic Modulator
Electro-optic modulators convert electrical signals to optical signals by varying light properties (intensity, phase, or polarization) in response to applied voltage. In silicon photonics, modulators typically work by carrier injection or depletion in waveguides, changing refractive index. Modulator bandwidth directly determines achievable data rates—modern devices exceed 100 GHz. Improving modulator efficiency, linearity, and bandwidth while reducing power consumption remains an active area of research crucial for scaling optical interconnects.
LPO
Linear-drive Pluggable Optics (LPO) simplifies optical module architecture by eliminating the DSP chip inside transceivers, driving the optical components directly from the switch ASIC. This reduces power consumption by 30-50% compared to traditional pluggables while maintaining compatibility with existing form factors. LPO represents an intermediate step toward CPO, offering efficiency gains without requiring new packaging technology. The approach works best over shorter distances where signal integrity can be maintained without DSP processing.
Mach-Zehnder Interferometer
The Mach-Zehnder Interferometer (MZI) is a fundamental building block for photonic computing and optical communications. It splits light into two paths, applies phase shifts, and recombines them, with output intensity depending on the phase difference. In photonic neural networks, cascaded MZIs can perform arbitrary unitary transformations, enabling matrix multiplication at the speed of light. The MZI's ability to continuously vary its transmission makes it suitable for implementing the weighted connections in neural networks.
Optical I/O
Optical I/O refers to chip-to-chip or die-to-die communication using light rather than electrical signals. As AI accelerators demand ever-higher bandwidth, electrical I/O faces fundamental limits from power consumption and signal integrity. Optical I/O promises 10-100x better energy efficiency at high bandwidths. Companies like Ayar Labs and Intel are developing optical I/O chiplets that integrate with processors. This technology could transform computer architecture by enabling disaggregated systems where memory, compute, and accelerators connect optically.
Optical Interconnect
Optical interconnects use light instead of electrical signals for data transmission, offering higher bandwidth over longer distances with lower power consumption than copper. In data centers, optics connect servers, switches, and storage across racks and buildings. As AI clusters scale to thousands of accelerators, optical interconnects become essential—electrical signaling cannot economically reach the required bandwidth over data center distances. The transition to optical connections is pushing closer to processors through co-packaged optics and eventually optical I/O.
Optical Neural Network
Optical Neural Networks (ONNs) implement neural network computations using light, potentially achieving orders of magnitude improvement in speed and energy efficiency for matrix operations. By encoding data as light intensities and using optical interference for computation, ONNs can perform multiply-accumulate operations passively, without active power consumption per operation. Companies like Lightmatter and research groups worldwide are developing ONNs for AI inference. Challenges include precision limitations, optical-electronic conversion overhead, and training networks to tolerate analog noise.
Photodetector
Photodetectors convert light signals back to electrical signals, complementing modulators in optical communication systems. In silicon photonics, germanium-on-silicon photodetectors detect telecom wavelengths that silicon cannot absorb. Key specifications include bandwidth (how fast signals can be detected), responsivity (electrical output per optical input), and dark current (noise). High-speed photodetectors enabling 100+ Gbaud reception are essential for modern data center optics and are a critical component in photonic computing systems.
Photonic Computing
Photonic computing uses photons (light particles) instead of electrons for computation, promising massive parallelism and energy efficiency. Light enables operations at the speed of light with minimal heat generation. Photonic processors can perform matrix multiplication—the core operation in neural networks—by encoding values as light intensities passing through programmable optical elements. Companies like Lightmatter, Luminous Computing, and Lightelligence are developing photonic AI accelerators. Challenges include interfacing with electronic systems and achieving sufficient precision for complex computations.
Photonic Integrated Circuit
Photonic Integrated Circuits (PICs) combine multiple optical components—lasers, modulators, waveguides, detectors—on a single chip, analogous to electronic integrated circuits. PICs enable complex optical systems in compact, mass-producible form factors. Applications range from telecommunications transceivers to LiDAR sensors to quantum computing components. Different material platforms (silicon, indium phosphide, silicon nitride) offer various tradeoffs between integration density, optical properties, and manufacturing cost.
Silicon Photonics
Silicon photonics leverages standard CMOS fabrication techniques to create optical components on silicon substrates, enabling mass production of photonic devices. The technology integrates waveguides, modulators, and detectors on chips, though light sources typically remain separate due to silicon's indirect bandgap. Major applications include high-speed data center interconnects, where silicon photonics enables 400G and 800G optical transceivers. Companies like Intel, Cisco, and Marvell produce silicon photonics products, while startups explore applications in computing and sensing.
Wavelength Division Multiplexing
Wavelength Division Multiplexing (WDM) transmits multiple independent data channels over a single optical fiber using different wavelengths (colors) of light. Dense WDM (DWDM) packs channels closely together for maximum capacity, while Coarse WDM (CWDM) uses wider spacing for lower cost. WDM is the foundation of long-haul telecommunications and increasingly important in data centers. Modern systems achieve capacities exceeding 100 Tbps per fiber pair by combining hundreds of wavelengths with advanced modulation formats.

Neuromorphic

Akida
BrainChip's Akida is a commercial neuromorphic processor targeting edge AI applications. Unlike research-focused neuromorphic chips, Akida is designed for practical deployment in consumer electronics, industrial sensors, and automotive applications. The chip supports both conventional convolutional neural networks and spiking neural networks, with on-chip learning capabilities. Akida's event-based processing enables efficient operation on sparse, temporal data while fitting within edge power budgets. BrainChip offers both chips and IP licensing for integration into other SoCs.
Dynamic Vision Sensor
Dynamic Vision Sensors (DVS), also called event cameras, output asynchronous events only when individual pixels detect brightness changes, unlike conventional cameras that capture full frames at fixed intervals. This approach reduces data volume by 10-100x for typical scenes while achieving microsecond temporal resolution. DVS pairs naturally with neuromorphic processors for efficient event-driven vision processing. Applications include high-speed robotics, autonomous vehicles, and surveillance where traditional cameras produce excessive redundant data. Companies like Prophesee and Samsung produce commercial DVS sensors.
Event-Driven Processing
Event-driven processing computes only when inputs change, in contrast to conventional synchronous systems that operate on fixed clock cycles regardless of activity. This approach, inspired by biological neural systems, can reduce power consumption by orders of magnitude for sparse, irregular data like sensor streams. Neuromorphic chips naturally support event-driven operation through asynchronous circuits and spiking networks. Event-driven systems excel at always-on monitoring tasks where inputs are mostly quiet but must respond quickly to changes.
Leaky Integrate-and-Fire
The Leaky Integrate-and-Fire (LIF) neuron is the most common computational model for spiking neural networks, balancing biological plausibility with computational efficiency. The model accumulates weighted input spikes on a "membrane potential" that leaks (decays) over time. When potential exceeds a threshold, the neuron fires a spike and resets. LIF neurons capture essential neural dynamics while remaining simple enough for efficient hardware implementation. Variations include adaptive thresholds and multi-compartment models for additional biological realism.
Loihi
Intel's Loihi is a neuromorphic research chip implementing spiking neural networks with on-chip learning capabilities. The second generation, Loihi 2, contains up to 1 million neurons with programmable learning rules. Unlike conventional AI accelerators, Loihi supports various neuron models and synaptic plasticity mechanisms, enabling research into brain-inspired algorithms. Applications demonstrated include adaptive robotics, optimization problems, and sparse coding. Intel positions Loihi as a research platform rather than a product, collaborating with academic and industry partners through the Intel Neuromorphic Research Community.
Memristor
Memristors are electronic components whose resistance depends on the history of current flow, effectively "remembering" past states. Theorized by Leon Chua in 1971 and first fabricated in 2008, memristors enable analog memory and in-memory computing by storing and processing data in the same physical location. In AI applications, crossbar arrays of memristors can perform matrix-vector multiplication in a single step, potentially achieving orders of magnitude improvement in energy efficiency over digital approaches. Challenges include device variability, endurance, and integration with digital systems.
Neuromorphic Computing
Neuromorphic computing mimics the brain's architecture and operating principles to achieve efficient, adaptive computation. Unlike conventional processors that separate memory and processing, neuromorphic chips integrate them like biological neurons and synapses. These systems use event-driven, asynchronous processing—computing only when inputs change—achieving remarkable energy efficiency for certain tasks. Applications include always-on sensing, robotics, and edge AI where power budgets preclude conventional approaches. Major players include Intel (Loihi), IBM (TrueNorth), and BrainChip (Akida).
SNN
Spiking Neural Networks (SNNs) process information using discrete spikes over time, mimicking biological neurons more closely than conventional artificial neural networks. Instead of continuous activations, SNN neurons accumulate inputs and fire only when reaching a threshold. This temporal coding enables efficient event-driven processing—neurons that don't spike consume minimal power. SNNs excel at processing temporal data like audio and sensor streams. Training SNNs remains challenging since gradient descent doesn't directly apply to discrete spikes, though surrogate gradient methods have improved results.
STDP
Spike-Timing-Dependent Plasticity (STDP) is a biological learning rule where synapse strength changes based on the relative timing of pre and post-synaptic spikes. If a presynaptic spike consistently arrives just before a postsynaptic spike, the connection strengthens (potentiation); reverse timing causes weakening (depression). STDP provides a local, unsupervised learning rule requiring no external supervision signal. Neuromorphic systems implement STDP to enable on-chip learning, though the simple rule often needs augmentation for complex tasks.
Synaptic Plasticity
Synaptic plasticity is the ability of connections between neurons to strengthen or weaken over time, forming the basis of learning and memory in biological and neuromorphic systems. The classic formulation "neurons that fire together wire together" (Hebbian learning) captures the basic principle. In neuromorphic chips, synaptic plasticity is implemented through programmable weights that adjust based on neural activity patterns. Understanding and implementing appropriate plasticity rules is key to enabling on-chip learning without requiring external training.
TrueNorth
IBM's TrueNorth, launched in 2014, was a landmark neuromorphic chip containing 1 million neurons and 256 million synapses while consuming only 70 milliwatts. The chip demonstrated that brain-inspired architectures could achieve remarkable efficiency for pattern recognition tasks. TrueNorth used a simplified neuron model without on-chip learning, requiring pre-training on conventional systems. While IBM has moved on to other architectures, TrueNorth proved neuromorphic computing's potential and influenced subsequent research into brain-inspired computing systems.

FPGAs

Adaptive SoC
Adaptive SoCs combine FPGA fabric with hard processor cores (typically ARM) and other fixed-function blocks in a single device. Examples include Xilinx Zynq and Intel Agilex SoC FPGAs. This integration enables efficient systems where processors handle control and I/O while FPGA fabric accelerates specific computations. Communication between processors and fabric uses high-bandwidth internal interfaces. Adaptive SoCs are popular in embedded applications, automotive, and edge AI where the combination of software flexibility and hardware acceleration provides optimal system solutions.
ASIC
Application-Specific Integrated Circuits are custom chips designed for particular applications, offering optimal performance and efficiency at the cost of lengthy development and high upfront investment. Unlike FPGAs that can be reconfigured, ASICs are fixed at manufacturing. AI ASICs like Google's TPU and specialized inference chips achieve 10-100x better efficiency than general-purpose processors for their target workloads. The decision between ASIC and FPGA involves tradeoffs between performance, flexibility, development cost, and time-to-market that shift as production volumes and requirements change.
Bitstream
A bitstream is the configuration file that programs an FPGA's logic cells, routing, and memory initialization. Generated by place-and-route tools from a synthesized design, bitstreams are loaded into the FPGA at power-up or runtime. Bitstream formats are typically proprietary and encrypted for security, though efforts like Project Icestorm have reverse-engineered formats for some devices. Bitstream size ranges from kilobytes for small FPGAs to hundreds of megabytes for the largest devices, with corresponding configuration times from milliseconds to seconds.
Block RAM
Block RAM (BRAM) refers to dedicated memory blocks within FPGAs, as opposed to distributed RAM implemented in LUT resources. BRAMs provide efficient, dual-ported memory for buffers, FIFOs, and data storage, with typical sizes of 18-36 kilobits per block. Total BRAM capacity ranges from hundreds of kilobits in small FPGAs to over 100 megabits in large devices. FPGA designs must carefully balance on-chip memory usage between BRAM and distributed RAM, considering access patterns, ports needed, and total capacity requirements.
CGRA
Coarse-Grained Reconfigurable Arrays use larger, more complex processing elements than FPGAs, enabling faster reconfiguration at the cost of flexibility. While FPGA reconfiguration takes milliseconds to seconds, CGRAs can switch configurations in nanoseconds to microseconds. This enables time-multiplexing of hardware resources across different operations. CGRAs suit applications with regular computational patterns like neural networks and signal processing. Research architectures like CGRA-ME and commercial implementations explore this space between fixed ASICs and fully flexible FPGAs.
DSP Block
DSP blocks are hard-wired multiply-accumulate units in FPGAs, optimized for signal processing and machine learning operations. A single DSP block can perform a multiply-add per clock cycle at much lower area and power than equivalent soft logic. Modern FPGAs contain thousands of DSP blocks supporting various precisions from INT8 to floating-point. AI accelerator designs heavily utilize DSP blocks for matrix operations, making DSP count a key specification. The fixed precision options in DSP blocks influence the data types used in FPGA-based neural network implementations.
eFPGA
Embedded FPGAs are FPGA intellectual property blocks that can be integrated into ASICs or SoCs, providing reconfigurable logic within otherwise fixed designs. eFPGAs enable post-silicon flexibility for protocol updates, algorithm customization, and bug fixes without full chip redesign. The technology appeals to communications, aerospace, and automotive applications requiring adaptability over long product lifetimes. Companies like Achronix, Flex Logix, and Menta offer eFPGA IP. Integration requires careful attention to interface design and may impact surrounding logic timing.
FPGA
Field-Programmable Gate Arrays are integrated circuits that can be configured after manufacturing to implement custom digital logic. FPGAs contain arrays of programmable logic blocks, memory, and DSP units connected by configurable routing. This flexibility enables hardware customization without the cost and time of custom chip development. In AI, FPGAs accelerate inference with custom precision formats and dataflows optimized for specific models. Xilinx (now AMD) and Intel (Altera) dominate the FPGA market, with chips ranging from small IoT devices to massive data center accelerators.
HDL
Hardware Description Languages like Verilog and VHDL describe digital circuit behavior and structure for simulation and synthesis. Unlike software programming languages that describe sequential execution, HDLs describe concurrent hardware operations. Verilog dominates in North America and ASIC design, while VHDL is common in Europe and aerospace/defense. Learning HDL requires shifting from software thinking to hardware thinking—understanding that all statements effectively execute simultaneously. Modern tools support SystemVerilog, which adds software-like verification features to Verilog.
High-Level Synthesis
High-Level Synthesis (HLS) tools convert C, C++, or other high-level language descriptions into hardware designs, dramatically reducing development time compared to manual RTL coding. HLS enables software developers to create hardware accelerators without deep hardware expertise. Major tools include Xilinx Vitis HLS, Intel HLS Compiler, and Catapult HLS. While HLS has improved significantly, results still typically trail hand-optimized RTL in area and performance. Effective HLS use requires understanding how code constructs map to hardware and applying appropriate pragmas and directives.
LUT
Look-Up Tables are the fundamental building blocks of FPGA logic, implementing arbitrary boolean functions by storing output values for all input combinations. A k-input LUT can implement any boolean function of k variables using 2^k memory bits. Modern FPGAs use 6-input LUTs with additional features like fractured mode for multiple smaller functions. LUT count is a primary measure of FPGA size, though actual design capacity depends on how efficiently synthesis tools can map logic. Additional hard blocks (DSPs, memory) handle functions that would consume excessive LUTs.
RTL
Register-Transfer Level is an abstraction for describing digital circuits in terms of data flow between registers and the combinational logic that transforms data. RTL designs are expressed in hardware description languages like Verilog or VHDL and form the input to synthesis tools that map designs to FPGA or ASIC implementations. Writing efficient RTL requires understanding both the algorithm and target hardware constraints. The gap between software programming and RTL design is a major barrier to custom hardware acceleration that high-level synthesis tools aim to bridge.

Memory

Crossbar Array
Crossbar arrays arrange memory cells at intersections of horizontal and vertical wires, enabling parallel access and efficient analog matrix operations. For AI applications, crossbar arrays can perform matrix-vector multiplication in O(1) time by applying input voltages to rows and measuring output currents from columns. The current through each cell is proportional to its conductance (stored weight) times applied voltage, and column currents sum contributions from all rows. This analog computation potentially offers massive efficiency gains but faces challenges from noise, variability, and limited precision.
CXL
Compute Express Link is an open interconnect standard built on PCIe physical layer, enabling coherent memory sharing between CPUs, accelerators, and memory expansion devices. CXL addresses the growing memory capacity and bandwidth demands of AI and analytics workloads by allowing memory pooling and disaggregation. Three protocol types support different use cases: CXL.io for I/O devices, CXL.cache for accelerator caching, and CXL.mem for memory expansion. Major players including Intel, AMD, and memory vendors are driving CXL adoption in next-generation data centers.
DRAM
Dynamic RAM stores bits in capacitors that require periodic refresh to maintain data, achieving higher density than SRAM at the cost of complexity and latency. DRAM forms the main memory in virtually all computing systems. DDR5 is the current generation for CPUs, while GDDR6 and HBM serve graphics and AI accelerators. DRAM scaling faces increasing challenges from capacitor physics at smaller geometries, driving interest in emerging memory technologies. The DRAM market is dominated by Samsung, SK Hynix, and Micron.
FeRAM
Ferroelectric RAM stores data using polarization of ferroelectric materials, offering non-volatility with fast writes and high endurance. FeRAM has found niche applications in smart cards, RFID, and microcontrollers where its unique properties justify higher cost than Flash. Recent advances using hafnium oxide enable FeRAM integration in standard CMOS processes, potentially enabling embedded ferroelectric memory in advanced logic chips. The technology competes with MRAM and ReRAM for non-volatile embedded memory applications.
In-Memory Computing
In-memory computing processes data directly within memory arrays rather than moving it to separate processors, eliminating the memory bottleneck that limits conventional architectures. For AI workloads dominated by matrix operations, in-memory computing using analog crossbar arrays can achieve orders of magnitude improvement in energy efficiency. Data is stored as conductance values in resistive memory cells, and matrix-vector multiplication happens physically as currents sum according to Ohm's and Kirchhoff's laws. Challenges include noise, limited precision, and device variability.
Memory Disaggregation
Memory disaggregation separates memory resources from compute nodes, connecting them over a fabric like CXL or specialized networks. This architecture enables flexible resource allocation—scaling memory independently of compute—and higher utilization through memory pooling. Disaggregation particularly benefits workloads with variable memory requirements and multi-tenant environments. Challenges include increased latency compared to local memory and the need for software that can tolerate or hide remote memory access costs. The approach is gaining traction as CXL matures and memory costs drive efficiency optimization.
Memory Wall
The memory wall describes the growing disparity between processor speed and memory access time that limits system performance. While processor performance has improved exponentially, memory latency has improved much more slowly, making memory access increasingly expensive relative to computation. The memory wall drives architectural innovations including larger caches, HBM for bandwidth, near-memory and in-memory computing, and AI accelerator designs that maximize data reuse. For large AI models, memory bandwidth often determines performance more than raw compute capability.
MRAM
Magnetoresistive RAM uses magnetic tunnel junctions to store data, offering non-volatility, high speed, and unlimited endurance. Unlike Flash that wears out after thousands of write cycles, MRAM can endure unlimited writes at SRAM-like speeds. STT-MRAM (Spin-Transfer Torque MRAM) is the dominant commercial variant, used in embedded applications as a universal memory combining the best features of SRAM, DRAM, and Flash. Everspin, Samsung, and GlobalFoundries produce MRAM, with applications ranging from automotive to aerospace to industrial IoT.
PCM
Phase-Change Memory stores data by switching chalcogenide materials between crystalline (low resistance) and amorphous (high resistance) phases using heat from electrical current. Intel's Optane products (now discontinued) used 3D XPoint, a PCM variant, for persistent memory bridging the gap between DRAM and storage. PCM offers better density than MRAM with true non-volatility, but faces endurance and write speed challenges. The technology remains relevant for in-memory computing where multi-level cells can store analog weights for neural networks.
PIM
Processing-In-Memory places computational logic within or adjacent to memory arrays, reducing the energy and latency of data movement. Unlike in-memory computing that uses analog physics, PIM typically uses digital logic near memory. Samsung's HBM-PIM adds compute units to HBM memory stacks, while UPMEM offers PIM-enabled DRAM modules. PIM excels at memory-bound workloads like database operations and sparse neural networks. The approach faces challenges in programming models, software support, and balancing compute capability with memory-optimized processes.
ReRAM
Resistive RAM stores data by switching a dielectric material between high and low resistance states through applied voltage. ReRAM offers non-volatility, high density, and fast switching suitable for both storage and in-memory computing applications. In AI accelerators, ReRAM crossbar arrays can perform analog matrix multiplication, with resistance values encoding network weights. Samsung, TSMC, and startups are developing ReRAM, though variability and endurance challenges have slowed commercial adoption. The technology competes with other emerging memories like MRAM and PCM.
SRAM
Static RAM uses flip-flop circuits to store bits, providing fast, random access without refresh requirements. SRAM is faster but more expensive and less dense than DRAM, making it ideal for caches close to processors. Modern CPUs contain tens of megabytes of SRAM cache organized in multiple levels. In FPGAs, SRAM cells hold configuration bits and user data. SRAM's six-transistor cell limits density compared to DRAM's one-transistor-one-capacitor design, but eliminates refresh overhead and enables simpler, faster access.
STT-MRAM
Spin-Transfer Torque MRAM uses spin-polarized current to switch the magnetic orientation of storage layers, enabling smaller, more efficient cells than field-switched MRAM. STT-MRAM has emerged as the leading commercial MRAM technology, offering a compelling combination of non-volatility, speed, and endurance. Applications include last-level cache replacement, persistent memory, and embedded non-volatile storage. The technology scales well with process shrinks, and foundries including TSMC, Samsung, and GlobalFoundries offer STT-MRAM options in advanced nodes.
Unified Memory
Unified memory provides a single address space shared between CPU and GPU, eliminating explicit data copies between separate memory pools. NVIDIA's CUDA Unified Memory and Apple's unified memory architecture exemplify this approach. Unified memory simplifies programming by allowing both processors to access the same data, with the system handling data movement automatically. Performance depends on access patterns and how well the runtime predicts data needs. For optimal performance, programmers may still need to provide hints or explicitly manage data placement.

Quantum

Decoherence
Decoherence is the loss of quantum properties due to unwanted interaction with the environment, limiting the useful computation time of quantum computers. Environmental noise causes qubits to lose their quantum superposition and entanglement, eventually behaving classically. Coherence times range from microseconds for superconducting qubits to seconds for trapped ions, setting the window for quantum operations. Quantum error correction aims to overcome decoherence by continuously detecting and correcting errors faster than they accumulate.
Dilution Refrigerator
Dilution refrigerators cool superconducting quantum computers to temperatures around 15 millikelvin—colder than outer space—using the mixing of helium-3 and helium-4 isotopes. These sophisticated cryogenic systems cost hundreds of thousands to millions of dollars and require significant infrastructure for operation. The extremely low temperatures are necessary to suppress thermal noise that would otherwise destroy quantum coherence in superconducting circuits. Companies like Bluefors and Oxford Instruments supply dilution refrigerators to the quantum computing industry.
Entanglement
Quantum entanglement correlates qubits such that measuring one instantly affects the state of others, regardless of distance. Einstein famously called this "spooky action at a distance." Entanglement is essential for quantum computing, enabling quantum algorithms to outperform classical ones. Creating and maintaining entanglement between many qubits is technically challenging and distinguishes true quantum computers from classical simulators. Entanglement also enables quantum communication applications like quantum key distribution that are impossible classically.
NISQ
Noisy Intermediate-Scale Quantum (NISQ) describes the current era of quantum computers, characterized by 50-1000+ qubits without full error correction. NISQ devices can demonstrate quantum phenomena but are limited by noise that accumulates over computations. Research focuses on variational algorithms and applications that might tolerate NISQ limitations, such as quantum chemistry simulations and optimization. The NISQ era bridges the gap between small, research-scale devices and future fault-tolerant quantum computers capable of running complex algorithms like Shor's factoring.
Quantum Error Correction
Quantum error correction protects fragile quantum information from decoherence and gate errors by encoding logical qubits in multiple physical qubits. Unlike classical error correction, QEC must handle both bit flips and phase errors without measuring (and destroying) the quantum state. Surface codes are the leading approach, requiring roughly 1000 physical qubits per logical qubit with current error rates. Achieving fault-tolerant quantum computing with error-corrected logical qubits is the major milestone needed for practical quantum computers to tackle useful problems.
Quantum Gate
Quantum gates are basic operations on qubits, analogous to logic gates in classical computing but operating on quantum states. Single-qubit gates rotate qubit states on the Bloch sphere, while two-qubit gates like CNOT create entanglement. Any quantum computation can be decomposed into a universal gate set, typically single-qubit rotations plus one entangling gate. Physical implementation depends on qubit technology: microwave pulses for superconducting qubits, laser pulses for trapped ions. Gate fidelity—how accurately gates perform their intended operation—is a key metric for quantum processors.
Quantum Supremacy
Quantum supremacy (also called quantum advantage) marks the milestone where a quantum computer performs a calculation that is practically impossible for classical computers. Google claimed this achievement in 2019 with their Sycamore processor, completing a specific random circuit sampling task in 200 seconds that they estimated would take a classical supercomputer 10,000 years. Critics noted the narrow, artificial nature of the task. True practical quantum advantage for useful problems remains a goal, with applications in optimization, chemistry simulation, and cryptography being actively researched.
Qubit
A qubit (quantum bit) is the fundamental unit of quantum information, analogous to classical bits but with the remarkable property of superposition—existing in combinations of 0 and 1 simultaneously until measured. This property, combined with entanglement between qubits, enables quantum computers to explore exponentially many states in parallel. Physical implementations include superconducting circuits, trapped ions, photons, and neutral atoms, each with distinct advantages and challenges. Qubit quality is characterized by coherence time, gate fidelity, and connectivity to other qubits.
Superconducting Qubit
Superconducting qubits use circuits of superconducting materials cooled to near absolute zero (about 15 millikelvin), where electrical current flows without resistance and quantum effects become observable at macroscopic scales. The dominant approach for companies including IBM, Google, and Rigetti, superconducting qubits can be fabricated using modified semiconductor processes and achieve fast gate operations. Challenges include the need for expensive dilution refrigerators, limited coherence times, and scaling to many qubits while maintaining connectivity and control.
Superposition
Superposition is the quantum mechanical property allowing a qubit to exist in a combination of 0 and 1 states simultaneously, rather than being definitively one or the other until measured. Mathematically, a qubit state is described as α|0⟩ + β|1⟩, where α and β are complex amplitudes. Upon measurement, the superposition collapses to either 0 or 1 with probabilities |α|² and |β|². This property enables quantum algorithms to process exponentially many possibilities in parallel, though extracting useful results requires clever algorithm design.
Trapped Ion
Trapped ion quantum computers use individual ions confined by electromagnetic fields as qubits, with quantum states encoded in electronic energy levels. Laser pulses manipulate qubits and create entanglement through their shared vibrational modes. IonQ and Quantinuum lead this approach, which offers excellent qubit quality with long coherence times and high gate fidelities. Challenges include slower gate speeds than superconducting systems and difficulties scaling to many ions in single traps, though modular architectures connecting multiple traps are being developed.

Manufacturing

Advanced Packaging
Advanced packaging technologies connect multiple dies with high-density interconnects, overcoming the limits of die scaling through integration at the package level. Techniques include 2.5D integration using silicon interposers, 3D stacking with through-silicon vias, and hybrid bonding for direct die-to-die connections. TSMC's CoWoS and InFO, Intel's EMIB and Foveros, and similar technologies enable the chiplet architectures increasingly used in AI accelerators. Advanced packaging has become as important as process technology for continued performance scaling.
Chiplet
Chiplets are modular chip designs connecting multiple smaller dies in a single package, rather than monolithic integration on one die. This approach improves manufacturing yield (smaller dies have fewer defects), enables mixing different process nodes and technologies, and allows flexible product configurations. AMD's Ryzen CPUs and NVIDIA's upcoming accelerators use chiplet architectures. Challenges include interconnect bandwidth and latency between chiplets, but advanced packaging technologies like those from TSMC are enabling chiplet designs to match or exceed monolithic performance.
Die
A die is an individual integrated circuit cut from a semiconductor wafer after fabrication. Die size affects manufacturing cost (larger dies have lower yield), power dissipation, and maximum transistor count. Modern AI accelerator dies approach the reticle limit (~800mm²), the largest pattern a lithography system can expose. The trend toward chiplets reflects the impracticality of further die size increases. Die-to-die connections, whether through interposers or advanced packaging, enable systems larger than single-die limits while maintaining yield and manufacturing economics.
EUV
Extreme Ultraviolet Lithography uses 13.5nm wavelength light (versus 193nm for previous generation) to pattern the finest chip features, enabling manufacturing at 7nm and below. EUV machines from ASML cost $150+ million and are essential for leading-edge production. The technology required decades of development to produce sufficient light intensity from tin droplets hit by lasers. Only TSMC, Samsung, and Intel operate EUV manufacturing at scale. The technology's complexity and cost concentrate advanced chipmaking among few players, with significant geopolitical implications.
FinFET
Fin Field-Effect Transistors revolutionized chip manufacturing at 22nm (Intel) and 14/16nm (foundries) by using a 3D fin-shaped channel wrapped by the gate on three sides. This structure provides better control over the channel than planar transistors, reducing leakage and enabling continued scaling. FinFET technology enabled another decade of Moore's Law progress. As scaling continues, the industry is transitioning to Gate-All-Around (GAA) transistors at 3nm and below, where the gate completely surrounds the channel for even better control.
Foundry
Semiconductor foundries manufacture chips designed by other companies, enabling the fabless business model that dominates the industry. TSMC, the world's largest foundry, produces chips for Apple, NVIDIA, AMD, and hundreds of other companies. Foundries invest billions annually to maintain leading-edge manufacturing capabilities. This specialization allows design companies to focus on innovation without the capital burden of fabs. The concentration of advanced manufacturing in Taiwan (TSMC) has become a significant geopolitical concern, driving investments in domestic production in the US, Europe, and Japan.
GAA
Gate-All-Around transistors represent the next evolution beyond FinFET, with the gate completely surrounding the channel for superior electrostatic control. Samsung began GAA production at 3nm, while TSMC transitions at 2nm. GAA implementations use stacked horizontal nanosheets or nanowires, enabling higher current drive and better scaling than FinFETs. The transition requires significant manufacturing process changes but is essential for continuing Moore's Law-style scaling. GAA transistors will power future generations of AI accelerators with improved performance and efficiency.
Interposer
An interposer is a silicon or organic substrate that sits between chips and package substrate, providing high-density wiring for chip-to-chip connections. Silicon interposers enable the fine-pitch connections needed for chiplet architectures and HBM integration. TSMC's CoWoS (Chip-on-Wafer-on-Substrate) uses silicon interposers to connect GPU dies with HBM stacks. Interposers add cost and complexity but enable system integration that would be impossible with direct package connections, making them essential for advanced AI accelerators.
Photomask
Photomasks are the master patterns used in lithography to transfer circuit designs onto silicon wafers. Modern chips require dozens of masks, each costing hundreds of thousands of dollars, with a complete set for advanced processes exceeding $10 million. Mask defects directly reproduce onto every wafer exposed, making mask quality critical. EUV lithography uses reflective masks due to the absorption of EUV light by transmission materials. The high cost and lead time of mask sets is a major barrier to custom chip development.
Process Node
Process nodes (like 7nm, 5nm, 3nm) historically indicated minimum transistor feature size but now serve primarily as marketing generations with different definitions across foundries. Modern nodes involve complex 3D transistors where no single dimension equals the node name. Each generation brings higher transistor density, improved performance, and lower power consumption, enabling more powerful chips. Leading-edge manufacturing requires billions in investment and is limited to TSMC, Samsung, and Intel. Process node is critical for AI accelerators where transistor density directly impacts compute capability.
Tape-out
Tape-out is the final handoff of chip design data to the foundry for manufacturing, marking the transition from design to fabrication. The term originates from when designs were recorded on magnetic tape. A tape-out represents months to years of design work and commits millions of dollars in mask and manufacturing costs. Once taped out, design changes require a costly and time-consuming respin. The stakes make pre-tape-out verification critical—undetected bugs in silicon are orders of magnitude more expensive to fix than in simulation.
TSV
Through-Silicon Vias are vertical electrical connections that pass completely through silicon dies, enabling 3D stacking of chips with high-bandwidth connections between layers. TSVs are essential for HBM memory stacks and advanced packaging technologies like CoWoS that connect chiplets through silicon interposers. Creating TSVs requires drilling holes through silicon and filling them with conductive material, adding manufacturing complexity. The technology enables bandwidth densities impossible with traditional wire bonding or flip-chip connections.
Wafer
A wafer is a thin slice of semiconductor material, typically silicon, on which integrated circuits are fabricated. Modern wafers are 300mm (12 inches) in diameter, containing hundreds of processor dies that are later separated and packaged. Wafer processing involves hundreds of steps: deposition, lithography, etching, and doping, each requiring precise control. Wafer cost and yield (percentage of working dies) are major factors in chip economics. Cerebras's Wafer-Scale Engine uniquely uses an entire wafer as a single chip, avoiding the yield and interconnect challenges of dicing.
Yield
Yield is the percentage of functional dies on a semiconductor wafer, directly impacting manufacturing economics. Defects from particles, process variations, and design issues can render dies non-functional. Larger dies have exponentially lower yields because they're more likely to contain defects. This yield relationship drives the chiplet trend: smaller dies have higher yields, and combining known-good chiplets achieves effective large-chip functionality without yield penalties. Leading-edge process yield improvements are closely guarded competitive advantages for foundries.

AI/ML

Activation Function
Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns beyond linear relationships. Without non-linearity, stacked layers would collapse to a single linear transformation. ReLU (Rectified Linear Unit) dominates due to its simplicity and effectiveness, though variants like GeLU and SiLU perform better in transformers. The choice of activation affects training dynamics, gradient flow, and computational requirements. Hardware accelerators often include optimized implementations of common activation functions.
Attention Mechanism
The attention mechanism allows neural networks to dynamically focus on relevant parts of input when producing output, weighting the importance of different elements based on learned relationships. Self-attention, the core of Transformers, computes attention between all positions in a sequence, enabling the model to capture long-range dependencies. The computational cost scales quadratically with sequence length, driving research into efficient attention variants like flash attention, sparse attention, and linear attention. Understanding attention is essential for optimizing Transformer models on AI accelerators.
Backpropagation
Backpropagation efficiently computes gradients for neural network training by applying the chain rule backwards through the network. Starting from the loss, gradients flow back through layers, enabling each parameter to know its contribution to the error. This algorithm, combined with gradient descent, made deep learning practical. Backpropagation requires storing intermediate activations from the forward pass (memory-intensive for large models) and is inherently sequential through layers (limiting parallelization). Techniques like gradient checkpointing trade computation for memory in memory-constrained settings.
Batch Size
Batch size determines how many samples are processed together in one forward/backward pass during training or inference. Larger batches improve hardware utilization and throughput but require more memory and can affect model convergence. Training large models uses small per-device batch sizes with gradient accumulation to achieve large effective batches across many accelerators. For inference, batching requests together amortizes fixed costs and improves throughput, though it increases latency for individual requests. Optimal batch size depends on hardware, model size, and latency requirements.
Dropout
Dropout is a regularization technique that randomly sets a fraction of activations to zero during training, preventing co-adaptation of features and reducing overfitting. During inference, dropout is disabled and activations are scaled to maintain expected values. While effective for smaller models, dropout is less commonly used in modern large language models which are typically undertrained relative to their capacity. The technique demonstrates how adding noise during training can improve generalization, a principle that appears in various forms across deep learning.
Embedding
Embeddings are dense vector representations that capture semantic meaning, converting discrete tokens, words, or entities into continuous numerical form that neural networks can process. Word embeddings like Word2Vec demonstrated that vector arithmetic captures semantic relationships (king - man + woman ≈ queen). Modern LLMs use learned embeddings for input tokens and produce embeddings that can be used for retrieval, classification, and similarity search. Embedding models optimized for semantic similarity power RAG systems and recommendation engines.
Fine-tuning
Fine-tuning adapts a pre-trained model to specific tasks or domains by continuing training on targeted data. Rather than training from scratch, fine-tuning leverages learned representations while adjusting them for new requirements. Techniques range from full fine-tuning (updating all parameters) to parameter-efficient methods like LoRA that update only small adapter modules. Fine-tuning has become the primary way organizations customize foundation models for their applications, requiring far less data and compute than pre-training while achieving strong task-specific performance.
Gradient Descent
Gradient descent is the fundamental optimization algorithm for training neural networks, iteratively adjusting parameters to minimize a loss function. The gradient indicates the direction of steepest increase; moving opposite to it reduces loss. Stochastic gradient descent (SGD) computes gradients on mini-batches rather than the full dataset, enabling tractable training on large datasets. Modern optimizers like Adam combine gradient information with momentum and adaptive learning rates for faster convergence. Understanding gradient behavior is essential for training stable, high-performing models.
Inference
Inference runs trained neural network models to make predictions on new data, in contrast to training which learns model parameters. Inference has different computational requirements than training: lower precision often suffices (INT8 vs FP16/FP32), batch sizes may be smaller, and latency often matters more than throughput. The inference market is larger and more diverse than training, spanning data centers, edge devices, and embedded systems. Specialized inference accelerators optimize for these characteristics, achieving better efficiency than training-focused hardware for production deployment.
KV Cache
The key-value cache stores computed attention keys and values from previous tokens during autoregressive LLM generation, avoiding redundant computation as the sequence grows. KV cache memory grows linearly with sequence length and batch size, often becoming the limiting factor for serving long-context models. Techniques like PagedAttention in vLLM optimize KV cache memory management, while methods like Multi-Query Attention and Grouped-Query Attention reduce KV cache size architecturally. Understanding KV cache behavior is essential for optimizing LLM inference cost and throughput.
Latency
Latency measures the time from request to response, critical for interactive AI applications like chatbots and real-time processing. For LLM inference, time-to-first-token (TTFT) affects perceived responsiveness while time-per-output-token determines generation speed. Latency optimization often conflicts with throughput optimization: batching improves throughput but increases individual request latency. Low-latency inference requires fast accelerators, optimized serving software, and sometimes specialized hardware like Groq's LPU designed specifically for latency-sensitive LLM serving.
LLM
Large Language Models are neural networks trained on massive text datasets to understand and generate human language. Models like GPT-4, Claude, and LLaMA contain billions to trillions of parameters and demonstrate remarkable capabilities including reasoning, coding, and creative writing. Training LLMs requires enormous compute resources—thousands of GPUs for months—while inference can run on single high-end GPUs or be distributed across many devices. LLMs have driven unprecedented demand for AI accelerators and reshaped the computing industry around their requirements for memory capacity, bandwidth, and specialized compute.
LoRA
Low-Rank Adaptation (LoRA) enables efficient fine-tuning by freezing pre-trained weights and training small low-rank decomposition matrices. Instead of updating all parameters, LoRA adds trainable pairs of small matrices to attention layers, reducing trainable parameters by 10,000x while matching full fine-tuning performance. The technique enables fine-tuning large language models on consumer GPUs and allows multiple lightweight adapters to be switched at inference time. LoRA and variants like QLoRA (quantized base model with LoRA) have democratized LLM customization.
Mixed Precision
Mixed precision training uses multiple numerical formats—typically FP16 for forward/backward passes and FP32 for weight updates—to accelerate training while maintaining accuracy. The technique leverages faster low-precision operations in tensor cores while preventing numerical issues through loss scaling and maintaining master weights in FP32. Modern frameworks automatically handle mixed precision with a few lines of configuration. BF16 (bfloat16) has gained popularity for its wider dynamic range than FP16, often eliminating the need for loss scaling.
Normalization
Normalization techniques stabilize neural network training by controlling the distribution of activations and gradients. Batch normalization normalizes across the batch dimension, while layer normalization (standard in transformers) normalizes across features. RMSNorm, a simplified variant, is increasingly popular in modern architectures. Normalization enables training deeper networks with higher learning rates by preventing activation values from exploding or vanishing. The placement and type of normalization significantly impacts model behavior and hardware efficiency.
Quantization
Quantization reduces neural network precision from 32-bit floats to lower-precision formats (FP16, INT8, INT4) to decrease memory usage, bandwidth requirements, and improve computational efficiency. Post-training quantization applies to trained models with some accuracy loss, while quantization-aware training maintains accuracy by modeling quantization during training. Modern accelerators include hardware support for low-precision formats, making quantization essential for efficient deployment. Aggressive 4-bit quantization enables running large language models on consumer hardware that couldn't otherwise fit them in memory.
RAG
Retrieval-Augmented Generation combines large language models with external knowledge retrieval to improve accuracy and reduce hallucinations. Instead of relying solely on knowledge encoded in model weights, RAG systems query vector databases or search engines for relevant information, then provide this context to the LLM for generating responses. This approach enables models to access current information, cite sources, and handle domain-specific knowledge without retraining. RAG has become a standard architecture for enterprise AI applications where accuracy and verifiability are essential.
Throughput
Throughput measures how much work a system completes per unit time—tokens per second for LLMs, images per second for vision models. Maximizing throughput reduces cost per inference but may increase individual request latency. Batch processing, request batching, and continuous batching techniques maximize throughput in serving systems. Training throughput (samples per second across all accelerators) determines how quickly models can be trained. The throughput-latency tradeoff is fundamental to AI system design, with different applications prioritizing different points on this spectrum.
Training
Training a neural network iteratively adjusts millions to trillions of parameters to minimize prediction errors on training data. The process involves forward passes computing predictions, loss calculation, and backward passes computing gradients that guide parameter updates. Training large models like GPT-4 requires thousands of GPUs running for months at costs exceeding $100 million. Training demands high-precision arithmetic (FP16/FP32), massive memory bandwidth, and sophisticated distributed computing across many accelerators. The extreme resource requirements have concentrated frontier AI research among well-funded organizations.
Transformer
The Transformer architecture, introduced in 2017's "Attention Is All You Need" paper, has revolutionized AI through its attention mechanism that models relationships between all parts of input sequences. Unlike earlier sequential models, Transformers process entire sequences in parallel, enabling efficient training on modern accelerators. Transformers power large language models (GPT, Claude, LLaMA), vision models (ViT), and multimodal systems. The architecture's success has reshaped AI hardware requirements, with accelerators optimized for the matrix operations and memory access patterns characteristic of Transformer inference and training.

Interconnect

Collective Operations
Collective operations are communication patterns used in distributed computing where multiple processes participate in data exchange. All-reduce (combining values from all nodes and distributing the result back) is critical for synchronizing gradients in data-parallel training. All-gather collects data from all nodes, while reduce-scatter is a fused reduce and scatter. Efficient collective implementations using techniques like ring all-reduce or tree algorithms minimize communication overhead. Libraries like NCCL (NVIDIA) and RCCL (AMD) provide optimized collective operations for multi-GPU systems.
Ethernet
Ethernet is the ubiquitous networking standard, evolving from 10 Mbps to 800 Gbps while maintaining backward compatibility. In AI data centers, high-speed Ethernet (100-400 GbE) competes with InfiniBand for cluster interconnect. Ethernet's advantages include broader vendor ecosystem, lower cost, and operational familiarity. Disadvantages include higher latency and less mature RDMA support compared to InfiniBand. Ultra Ethernet Consortium efforts aim to close the gap for AI workloads. The choice between Ethernet and InfiniBand involves tradeoffs between performance, cost, and operational complexity.
InfiniBand
InfiniBand is a high-throughput, low-latency networking standard that dominates AI cluster interconnects. NVIDIA's (formerly Mellanox) ConnectX adapters and Quantum switches provide up to 400 Gb/s per port with microsecond latencies. InfiniBand's RDMA capability enables direct memory access between nodes without CPU involvement, essential for efficient distributed training. While Ethernet is ubiquitous, InfiniBand's performance advantages make it the preferred choice for demanding AI workloads where communication overhead significantly impacts training time.
NVLink
NVLink is NVIDIA's proprietary high-bandwidth interconnect for GPU-to-GPU communication, offering 7-10x the bandwidth of PCIe. The latest NVLink 4.0 in H100 systems provides 900 GB/s bidirectional bandwidth per GPU. NVLink enables efficient scaling across multiple GPUs for large model training, keeping data movement off the slower PCIe bus. Combined with NVSwitch, NVLink creates fully-connected GPU clusters where any GPU can communicate directly with any other. This interconnect advantage is key to NVIDIA's dominance in multi-GPU AI training systems.
NVSwitch
NVSwitch is NVIDIA's high-bandwidth switch ASIC that enables all-to-all GPU connectivity in multi-GPU systems. A single NVSwitch connects up to 8 GPUs with full bisection bandwidth, and multiple switches create fully-connected topologies in larger systems. The DGX H100 uses NVSwitch to provide 900 GB/s connectivity between any pair of its 8 GPUs, critical for efficient distributed training where all-reduce operations dominate communication. NVSwitch is a key differentiator for NVIDIA's data center platforms, enabling scaling that would be impossible with PCIe alone.
PCIe
PCI Express is the standard interface connecting accelerators to CPUs in servers and workstations. PCIe 5.0 offers 64 GB/s per x16 slot, with PCIe 6.0 doubling this. While sufficient for many workloads, PCIe bandwidth limits multi-GPU scaling, motivating proprietary interconnects like NVLink. PCIe is also the foundation for CXL, which adds coherency protocols for memory sharing. Understanding PCIe bandwidth limitations is important for system design—many accelerators can saturate PCIe connections, making local memory capacity and inter-accelerator links crucial for performance.
RDMA
Remote Direct Memory Access enables direct memory transfer between computers without involving their CPUs or operating systems, achieving low latency and high throughput. RDMA is essential for distributed AI training where gradient synchronization between nodes would otherwise bottleneck on CPU-mediated networking. InfiniBand natively supports RDMA, while RoCE (RDMA over Converged Ethernet) brings similar capabilities to Ethernet networks. RDMA semantics allow one-sided operations where the initiator can read or write remote memory without involving the remote CPU.
UCIe
Universal Chiplet Interconnect Express (UCIe) is an open standard for die-to-die connections, enabling chiplets from different vendors to interoperate. Backed by major players including Intel, AMD, ARM, and TSMC, UCIe defines physical and protocol layers for chiplet integration. The standard aims to create a chiplet ecosystem similar to how PCIe standardized board-level connectivity. UCIe supports bandwidths exceeding 1 TB/s per mm of die edge, enabling high-performance multi-chiplet designs. Standardization could accelerate chiplet adoption by reducing integration complexity and enabling mixing vendors.

Edge

Edge Computing
Edge computing processes data near its source rather than in centralized cloud data centers, reducing latency, bandwidth costs, and privacy concerns. Edge AI deploys neural network inference on local devices from smartphones to industrial sensors to autonomous vehicles. The edge encompasses a vast range of devices with different power budgets (milliwatts to tens of watts), from microcontrollers running TinyML to edge servers with discrete accelerators. The diversity of edge requirements has spawned a rich ecosystem of specialized AI chips targeting different application domains and power envelopes.
SoC
System on Chip designs integrate CPU, GPU, memory controller, AI accelerator, and peripheral interfaces on a single die or package. SoCs dominate mobile devices and are increasingly important in automotive, IoT, and edge AI applications. Apple's M-series chips and Qualcomm's Snapdragon exemplify high-performance SoCs with integrated neural processing units. SoC design requires balancing diverse requirements across components while meeting power, thermal, and cost constraints. The trend toward AI everywhere is making neural processing capabilities a standard SoC feature alongside traditional components.
TinyML
TinyML brings machine learning to microcontrollers and ultra-low-power embedded devices with power budgets of milliwatts or even microwatts. Applications include wake word detection, anomaly detection, and predictive maintenance on battery-powered sensors. TinyML requires aggressive model optimization: quantization to INT8 or below, pruning, and architecture search for tiny models. Frameworks like TensorFlow Lite Micro and specialized hardware from companies like Syntiant enable AI on devices that previously couldn't support it. TinyML is expanding the reach of AI to billions of embedded devices.

Power

Performance Per Watt
Performance per watt measures computational efficiency—how much useful work a system delivers relative to its power consumption. For AI, this might be expressed as TOPS/W or tokens per second per watt. This metric has become critical as AI workloads scale to consume megawatts in data centers and must fit within milliwatts at the edge. Efficiency improvements come from architectural innovation, process technology advances, and specialization for AI workloads. The AI hardware competition increasingly focuses on efficiency alongside raw performance, especially for inference where operational costs dominate.
PUE
Power Usage Effectiveness measures data center efficiency as the ratio of total facility power to IT equipment power. A PUE of 1.0 would mean all power goes to computing with no overhead; typical data centers range from 1.2-2.0, with hyperscalers achieving 1.1-1.2. AI workloads with 700W GPUs stress cooling systems and drive facilities toward liquid cooling for better PUE. As AI clusters grow to hundreds of megawatts, PUE improvements translate to significant cost savings and environmental impact reduction.

Cloud

AI Cluster
AI clusters are interconnected groups of accelerators—typically thousands of GPUs—designed for large-scale model training. Building effective clusters requires not just powerful accelerators but high-bandwidth networking, efficient cooling, robust power infrastructure, and sophisticated software for distributed training. NVIDIA's DGX SuperPOD and similar offerings provide integrated solutions. The largest clusters contain tens of thousands of GPUs consuming tens of megawatts. Cluster design involves careful attention to network topology, failure handling, and efficiency at scale.
GPU Cloud
GPU cloud services provide on-demand access to GPU computing resources, enabling organizations to train and deploy AI models without owning hardware. Major providers (AWS, Azure, GCP) offer various GPU instances from single-GPU VMs to multi-node clusters with high-speed interconnects. Specialized providers like CoreWeave, Lambda, and RunPod focus specifically on GPU workloads, often with newer hardware and AI-focused features. The GPU cloud market faces supply constraints from NVIDIA chip shortages, driving up prices and wait times for premium instances.
Hyperscaler
Hyperscalers are the largest cloud providers—Amazon Web Services, Microsoft Azure, Google Cloud, and others—operating data centers at massive scale with millions of servers. These companies are major consumers and increasingly designers of AI accelerators, developing custom chips (Google TPU, Amazon Trainium, Microsoft Maia) to reduce dependence on NVIDIA. Hyperscaler infrastructure decisions shape the AI hardware market, as they purchase significant fractions of global GPU production. Their custom silicon efforts could eventually compete with NVIDIA's dominance in AI computing.
Inference Endpoint
Inference endpoints are API services that host and serve machine learning models, handling scaling, load balancing, and availability. Cloud providers offer managed inference services (AWS SageMaker, Azure ML, Vertex AI) that simplify deployment, while platforms like Replicate and Together AI provide model-specific endpoints. Self-hosted options using vLLM, TGI, or Triton Inference Server give more control. The choice involves tradeoffs between ease of use, cost, latency, and customization. Inference endpoint architecture significantly impacts serving cost and user experience.
Liquid Cooling
Liquid cooling has become necessary for high-power AI accelerators where air cooling cannot remove enough heat. Direct-to-chip liquid cooling circulates coolant through cold plates on GPUs and CPUs, while immersion cooling submerges entire servers in dielectric fluid. NVIDIA's H100 SXM variant is designed for liquid cooling, enabling higher power limits than air-cooled versions. The transition to liquid cooling requires significant data center infrastructure investment but enables higher power densities and more efficient operation for AI workloads.
Model Serving
Model serving encompasses the infrastructure and software for deploying trained models to handle inference requests in production. Challenges include optimizing latency and throughput, managing model versions, handling variable load through auto-scaling, and ensuring reliability. Serving systems like NVIDIA Triton, TensorFlow Serving, and TorchServe provide frameworks for deployment. For LLMs, specialized serving systems (vLLM, TGI, TensorRT-LLM) optimize for autoregressive generation. Model serving architecture significantly impacts both user experience through latency and operational costs through hardware efficiency.
Orchestration
Orchestration systems manage the deployment, scaling, and operation of distributed applications across compute clusters. Kubernetes has become the standard for container orchestration, with extensions like Kubeflow for ML workflows. AI training jobs require specialized orchestration for multi-node GPU scheduling, fault tolerance, and resource management. Tools like Ray, Slurm, and cloud-specific services coordinate distributed training across thousands of accelerators. Effective orchestration is essential for efficiently using expensive GPU resources and managing complex training pipelines.
Spot Instance
Spot instances provide cloud compute at discounts of 60-90% compared to on-demand pricing, using spare capacity that can be reclaimed with short notice. For fault-tolerant AI workloads like distributed training with checkpointing, spot instances dramatically reduce costs. The tradeoff is unpredictable interruptions requiring checkpoint/restart capability. Spot markets exist across major clouds with varying pricing models and interruption rates. Effective use of spot instances requires workload architecture that handles preemption gracefully, typically through frequent checkpointing and elastic scaling.

Other

ARM
ARM (originally Acorn RISC Machine) is the dominant processor architecture for mobile devices and increasingly for servers and AI edge deployment. Unlike Intel's x86, ARM licenses its architecture to other companies who design custom implementations, enabling the diverse ecosystem of mobile SoCs. ARM's power efficiency makes it ideal for battery-powered devices and has driven adoption in data centers where operational costs matter. NVIDIA's acquisition attempt failed on regulatory grounds, but ARM remains central to computing from smartphones to AWS's Graviton servers to Apple's M-series chips.
Benchmark
Benchmarks are standardized tests for measuring and comparing hardware or software performance. In AI, MLPerf has emerged as the industry standard, with separate benchmark suites for training and inference across various model types. Benchmarks enable apples-to-apples comparisons but can also be gamed through narrow optimizations that don't translate to real workloads. Understanding benchmark methodology and limitations is essential for interpreting performance claims. Real application performance often differs from benchmarks due to differences in batch sizes, precision requirements, and software optimization.
MLPerf
MLPerf is the industry-standard benchmark suite for AI hardware and software performance, run by the MLCommons organization. Separate benchmarks cover training (time to train standard models to target accuracy) and inference (throughput and latency for serving trained models). MLPerf results are submitted by hardware and cloud vendors, enabling comparisons across GPUs, TPUs, and specialized accelerators. The benchmarks evolve to reflect current AI workloads, adding large language model tests as LLMs dominate the industry. MLPerf results inform purchasing decisions worth billions of dollars annually.
RISC-V
RISC-V is an open-source instruction set architecture that enables custom processor designs without licensing fees or restrictions. Unlike proprietary architectures from ARM and Intel, RISC-V allows anyone to design compatible processors. This openness has sparked a wave of innovation in custom AI accelerators, where companies can add specialized instructions for neural network operations. Major players including Google, Alibaba, and Western Digital use RISC-V, and the architecture is gaining traction from microcontrollers to data center chips. RISC-V represents a fundamental shift toward open hardware in computing.

Data Formats

BF16
BFloat16 is a 16-bit floating-point format that preserves FP32's dynamic range (8 exponent bits) while reducing precision (7 mantissa bits vs 23 in FP32). Developed by Google for TPUs and now supported across NVIDIA, AMD, and Intel hardware, BF16 enables mixed-precision training without the loss scaling required for FP16. The format's dynamic range matches FP32, preventing the overflow issues that plague FP16 in gradient calculations. BF16 has become the preferred training format for large language models and is increasingly used for inference as well.
FP16
FP16 (16-bit floating point, half precision) halves memory usage and doubles throughput compared to FP32 on hardware with native support. Tensor Cores and similar units accelerate FP16 operations by 2-8x over FP32. However, FP16's limited dynamic range (5 exponent bits vs 8 in FP32) can cause overflow or underflow in gradients during training, requiring loss scaling techniques. Despite these challenges, FP16 became the workhorse of mixed-precision training and remains widely used for inference where its precision is sufficient for most models.
FP32
FP32 (32-bit floating point) is the standard precision for scientific computing and neural network training. Following the IEEE 754 standard, FP32 provides approximately 7 decimal digits of precision with a wide dynamic range. While sufficient for most calculations, FP32 is increasingly replaced by lower-precision formats in AI to improve performance and reduce memory usage. Modern training often uses FP32 only for gradient accumulation and weight updates, with forward and backward passes in FP16 or BF16. FP32 remains important as the reference precision for accuracy validation.
FP8
FP8 (8-bit floating point) is an emerging format for AI that promises 2x memory and bandwidth efficiency over FP16. Two variants exist: E4M3 (4 exponent, 3 mantissa bits) for forward passes and E5M2 (5 exponent, 2 mantissa bits) for gradients. NVIDIA's H100 introduced hardware FP8 support, enabling training and inference with minimal accuracy loss for many models. FP8 is particularly valuable for large language model inference, where memory bandwidth typically limits performance. Adoption requires careful quantization and may need model-specific tuning for optimal results.
INT4
INT4 (4-bit integer) quantization aggressively compresses neural networks to enable large language models on consumer hardware. Techniques like GPTQ, AWQ, and GGML have made 4-bit quantization practical for LLM inference with acceptable quality trade-offs. At 4 bits, a 70-billion parameter model fits in 35GB instead of 140GB, enabling local deployment on high-end consumer GPUs. However, INT4 requires more sophisticated quantization algorithms than INT8, and some models are more sensitive to the precision reduction than others. The format is primarily used for inference, not training.
INT8
INT8 (8-bit integer) quantization has become standard for neural network inference, offering 4x memory reduction and significantly faster computation versus FP32. Most modern AI accelerators heavily optimize INT8 operations, achieving peak throughput on this format. Post-training quantization can convert FP32 models to INT8 with minimal accuracy loss for many architectures. Calibration using representative data helps map floating-point ranges to integer values. INT8 inference enables deployment on resource-constrained devices and reduces serving costs in data centers.
Sparsity
Sparsity exploits the observation that many neural network weights and activations are zero or near-zero, enabling computation and memory optimizations. Structured sparsity (e.g., 2:4 pattern where 2 of every 4 elements are zero) is supported by hardware like NVIDIA's Ampere Tensor Cores, potentially doubling throughput. Unstructured sparsity offers higher compression but is harder to accelerate. Pruning techniques create sparse networks during or after training, while sparse attention patterns reduce computational costs in Transformers. Sparsity-aware hardware and algorithms are increasingly important for efficient AI.

Software

DeepSpeed
DeepSpeed is Microsoft's deep learning optimization library focused on enabling efficient training of massive models. Its ZeRO (Zero Redundancy Optimizer) technique partitions optimizer states, gradients, and parameters across data-parallel workers, dramatically reducing memory requirements. DeepSpeed has enabled training of models with trillions of parameters and popularized techniques like ZeRO-Offload that utilize CPU memory and NVMe storage. The library integrates with PyTorch and has become essential infrastructure for organizations training large-scale models.
Flash Attention
Flash Attention is an algorithm that computes exact attention 2-5x faster and with much less memory by restructuring computation to maximize GPU memory hierarchy efficiency. Standard attention materializes large intermediate matrices that overflow fast SRAM into slow HBM; Flash Attention tiles the computation to keep data in SRAM. The algorithm has become essential for training and inference of long-context language models, enabling context lengths that would otherwise exhaust GPU memory. Flash Attention 2 and subsequent versions continue improving performance across different hardware.
Hugging Face
Hugging Face has become the central hub for the machine learning community, providing the Transformers library, model hub, datasets, and collaborative tools. The Transformers library offers a unified API for thousands of pre-trained models across NLP, vision, and audio domains. The model hub hosts hundreds of thousands of models that can be downloaded and fine-tuned with a few lines of code. Hugging Face's influence on democratizing access to state-of-the-art AI cannot be overstated—it has become essential infrastructure for AI development and deployment.
JAX
JAX is Google's functional numerical computing library that combines NumPy's interface with automatic differentiation, JIT compilation, and seamless parallelization across accelerators. Unlike PyTorch's object-oriented style, JAX uses a functional programming paradigm where transformations compose cleanly. JAX's XLA compiler backend achieves excellent performance on TPUs and GPUs. The framework has gained popularity for large-scale research at Google DeepMind and other labs, particularly for work requiring sophisticated automatic differentiation or custom parallel algorithms. JAX represents a modern approach to differentiable programming.
Megatron
Megatron is NVIDIA's framework for training large transformer models using model parallelism, pipeline parallelism, and data parallelism in combination. The framework demonstrated that scaling transformers to billions of parameters improves performance on downstream tasks, inspiring the large language model revolution. Megatron techniques partition model layers across GPUs (tensor parallelism), sequence positions across time steps (pipeline parallelism), and data across replicas. These parallelization strategies have become standard approaches for training frontier AI models on thousands of accelerators.
ONNX
Open Neural Network Exchange (ONNX) is an open format for representing machine learning models, enabling interoperability between frameworks. A model trained in PyTorch can be exported to ONNX and deployed using TensorRT, OpenVINO, or other inference runtimes. ONNX defines a common set of operators and a standard file format, though not all operations translate perfectly between frameworks. The ONNX Runtime from Microsoft provides cross-platform inference with optimizations for different hardware. ONNX has become essential infrastructure for production ML deployment across diverse hardware targets.
PyTorch
PyTorch is the dominant deep learning framework for research and increasingly for production, known for its intuitive Python-first design and dynamic computation graphs. Developed by Meta AI, PyTorch allows defining and modifying neural networks on-the-fly, simplifying debugging and experimentation. The framework's eager execution mode runs operations immediately, unlike TensorFlow's original graph-based approach. PyTorch has become the default choice for academic research and startup AI development, with strong support across GPU vendors through backends like ROCm for AMD and Intel extensions.
TensorFlow
TensorFlow, developed by Google Brain, is a comprehensive machine learning platform supporting everything from mobile deployment to distributed training across thousands of TPUs. Originally using static computation graphs that were difficult to debug but efficient to execute, TensorFlow 2.0 added eager execution similar to PyTorch. The ecosystem includes TensorFlow Lite for mobile/edge, TensorFlow.js for browsers, and TensorFlow Serving for production deployment. While PyTorch dominates research, TensorFlow remains important in production environments, especially within Google's ecosystem.
TensorRT
TensorRT is NVIDIA's SDK for high-performance deep learning inference, optimizing trained models for deployment on NVIDIA GPUs. TensorRT applies graph optimizations (layer fusion, precision calibration, kernel auto-tuning) to maximize throughput and minimize latency. The optimizer can reduce models to INT8 or FP16 precision with calibration-based quantization. TensorRT achieves significant speedups over framework-native inference, making it essential for production deployment on NVIDIA hardware. Integration with TensorFlow and PyTorch through converters and the ONNX pathway enables broad model support.
Triton
Triton is an open-source language and compiler for writing highly efficient GPU kernels, developed by OpenAI. Unlike CUDA, which requires explicit management of memory hierarchy and thread synchronization, Triton uses a block-based programming model that abstracts these details while enabling performance approaching hand-tuned kernels. Triton has accelerated development of custom GPU operations for LLMs, including flash attention implementations. The language is becoming increasingly important as researchers need to implement novel algorithms that aren't available in standard libraries.
vLLM
vLLM is an open-source LLM serving library that achieves high throughput through PagedAttention, a technique that manages KV cache memory like virtual memory pages. By eliminating memory fragmentation in the key-value cache used during autoregressive generation, vLLM can serve more concurrent requests on the same GPU memory. The library supports continuous batching, quantization, and distributed inference across multiple GPUs. vLLM has become a popular choice for self-hosted LLM deployment, achieving throughput competitive with proprietary solutions.

Explore the Companies

Now that you understand the terminology, discover the companies building these technologies.