← Library

The Datacenter Series Part 3: Networking Systems

Crucible Capital · Kelly Greer, Meltem Demirors

Networking has become the binding constraint on AI cluster performance, and the industry is now fighting over the optical switching standard — Arista's XPO versus co-packaged optics — that will define the next decade of datacenter buildout.

Interconnect is the second-largest AI spend after chips, which is why Nvidia's 2019 Mellanox deal turned it into a systems vendor and why the next optical standard matters so much. Co-packaged optics promise density but break the existing servicing model: failed parts mean swapping whole switches, thermals are hard, and the vendor ecosystem isn't ready. Arista's XPO, announced at OFC 2026 with backing from 100+ vendors and Microsoft, bets that a larger pluggable with built-in cooling — one that inherits today's supply chains and repair workflows — will scale faster than CPO, much like Arista's EOS came to define the last decade of network operating systems.


claim

Networking is the second-largest line item in AI spend after chips themselves, and as clusters scale into the hundreds of thousands of GPUs, interconnect speed and stability — not raw compute — set the ceiling on what those clusters can actually deliver.

central 1.00
claim

The 2019 $6.9B Mellanox acquisition gave Nvidia control of the InfiniBand fabric tying thousands of GPUs into one logical supercomputer, transforming it from a chip vendor into a full systems vendor.

central 0.90
claim

Co-packaged optics face multiple barriers to scale: failures require swapping entire switches, vendor ecosystems are immature, standards are unsettled, and thermals are hard. These compound to make CPO adoption slow.

central 0.90
claim

Because XPO inherits existing pluggable workflows, supply chains, and vendor servicing, the author expects it to scale faster than CPO. They liken Arista's XPO play to how its open EOS defined the last decade of network OS.

central 0.90
claim

eXtended Pluggable Optics, announced at OFC 2026 and shipping in 2027, use a larger pluggable form factor with built-in cold plates to replace 8 legacy modules and cut switch racks by 75%. Over 100 vendors have endorsed it, with Microsoft calling it 'an important milestone.'

central 0.85

Open

  • · Will CPO's operational problems get solved fast enough to compete with XPO, or will XPO lock in the standard before CPO matures?
  • · Does Microsoft's endorsement translate into other hyperscaler commitments, or is XPO's vendor coalition broad but shallow?

Pipeline

source kind
url
generated by
anthropic
candidates
58 (selected 5)
embeddings

Sections

Candidate pool grouped by section. Selected candidates are bolded.

Considered candidates (53)

Below top-k · 53

  • evidenceEven the best training runs leave most compute idlec 0.80

    xAI's 550k GPU cluster reportedly ran at 11% utilization, and even Meta's highly optimized Llama 3 run hit only ~40% MFU — with networking and memory latency, not compute, accounting for most of the loss.

  • claim2018 was the year networking stopped being passivec 0.80

    From 2000 to roughly 2018, networking was dragged along behind compute and just used whatever Ethernet was available; the arrival of billion-parameter models flipped this and made networking — alongside memory and power — the active constraint on compute advancement.

  • claimGPT-3 created the modern scale-out problemc 0.80

    GPT-3 needed 800GB–1TB+ of memory while a single A100 capped at 80GB, forcing distributed training across dozens of servers and turning cluster-wide synchronization into the central networking challenge.

  • evidenceUltra Ethernet Consortium and Llama 3 proved Ethernet could competec 0.80

    Intel, AMD, Broadcom, Microsoft, Meta, Google and others formed UEC to bolt InfiniBand-like features onto Ethernet, and Meta's Llama 3 training on parallel Ethernet and InfiniBand clusters served as proof that InfiniBand's dominance was breakable.

  • mechanismCopper's reach collapses as link speeds risec 0.80

    At 56 Gbps in 2011, passive copper DAC cables could carry signals 3–5 meters; by 100 Gbps EDR in 2014 that shrank to 1–2 meters — only rack-internal hops. Every speed doubling further compresses copper's useful distance.

  • claimNetworking is a power-per-bit problemc 0.80

    Pushing more data through copper requires repeated amplification and re-driving, so power consumption rises brutally with speed. Optics carry photons over meters or kilometers at roughly fixed power, making fiber the only viable medium at 800 Gbps+.

  • claimEast-west is where all the innovation is happeningc 0.75

    The vast majority of networking R&D targets the east-west fabric because connecting compute nodes to each other is what gates training and inference throughput — Marvell has called scale-up interconnect the most strategically important opportunity in AI infrastructure.

  • contextNetworking eats nearly 30% of cluster powerc 0.70

    Interconnect consumes roughly 30% of total cluster power, which in a world of scarce terrestrial power makes network power efficiency — not just bandwidth — a first-order design target.

  • claimCo-packaged optics promise 5x power efficiencyc 0.70

    The successor to today's top-of-rack switches will use liquid-cooled silicon photonics co-packaged with the switch ASIC, delivering roughly 5x the power efficiency of pluggable transceivers — the central reason public markets are bidding up anything touching CPO.

  • mechanismHost-staging forced every GPU packet through the CPUc 0.70

    Early HPC networking routed data GPU → PCIe → CPU → NIC → network and back, adding latency and burning CPU cycles on every hop. This bottleneck is what later GPU networking innovations were designed to eliminate.

  • evidenceAlexNet proved more GPUs and data make better modelsc 0.70

    In 2012 AlexNet's 60M-parameter network cut ImageNet top-5 error from 26.2% to 15.3%, convincing researchers that the path forward was multi-GPU training — and immediately hitting PCIe bandwidth limits.

  • mechanismNVLink replaced the shared PCIe bus with a private GPU highwayc 0.70

    Nvidia's 2016 NVLink offered a dedicated GPU-to-GPU interconnect — 20 GB/s per link in Gen 1, 50 GB/s by 2018 — sidestepping the PCIe bandwidth constraint that throttled multi-GPU training.

  • implicationInference demands a different fabric than trainingc 0.70

    ChatGPT-style workloads are spiky, user-specific, and latency-sensitive rather than batch-synchronous, forcing operators toward RDMA-capable fabrics with very high bisection bandwidth and careful L2/L3 design for live serving.

  • claimNvidia's vertical lock-in spans silicon to all-reducec 0.70

    By 2023 Nvidia owned roughly 80% of AI cluster networking via GPUs, NVLink, InfiniBand, and NCCL — buying into training meant buying an Nvidia stack end to end.

  • mechanismCo-packaged optics eliminate the last copper tracec 0.70

    CPO moves the optical engine onto the switch package itself, removing the few centimeters of PCB copper that still burn significant power at 800 Gbps+, and delivers 30–40%+ lower interconnect power plus lower latency.

  • mechanismA CPO failure means replacing the whole switchc 0.70

    Unlike pluggable transceivers that swap in minutes, a CPO fault takes down the entire switch and requires hours to replace. Broadcom's 2024 Bailly generation introduced detachable sub-assemblies to address this serviceability problem.

  • mechanismNvidia is milking pluggables while seeding CPOc 0.70

    Networking, including optical transceivers, is now 18% of Nvidia's revenue and beat estimates while compute didn't. Nvidia has every incentive to keep the pluggable cash cow flowing while quietly planting CPO seeds.

  • claimLPO captures most of CPO's power benefit without the painc 0.70

    Linear Pluggable Optics swap the DSP on a transceiver for a linear transimpedance amplifier, delivering 70-80% of CPO's power savings while keeping hot-swappability, serviceability, and multi-vendor interop.

  • implicationNetworking is dispersing just like the chip marketc 0.70

    The proliferation of form factors (CPO, XPO, OCS, LPO, plus disaggregated memory, in-network compute, free-space optics) mirrors what's happening at the chip level. Investors have to pick horses across a fragmenting stack rather than bet on a single standard.

  • claimNetworking is increasingly about physics, not protocolsc 0.65

    Much of modern networking work is about optimizing the physics of information using novel geometries and materials — squeezing copper out in favor of fiber, then pushing optics as close to the chip as possible.

  • mechanismOptical Circuit Switching sets up dedicated light pathsc 0.65

    OCS replaces packet-by-packet electrical routing with a direct optical path held open for a communication, like a dedicated lane. The tradeoff is that it can't split traffic across paths and reconfiguration takes milliseconds versus microseconds for electronic switches.

  • claimCollective communication libraries are an AMD/Nvidia duopoly that breaks on MoEc 0.65

    NCCL and RCCL orchestrate data movement across thousands of GPUs but struggle with heterogeneous clusters and Mixture-of-Experts models where token routing is dynamic. Meta forked NCCL into NCCLX, and the Ultra Ethernet Consortium has room to ship a hardware-agnostic alternative.

  • implicationHyperscalers are vertically integrating networking alongside siliconc 0.60

    Because power is the binding constraint, every hyperscaler is now building custom networking systems in parallel with custom chips to wring more tokens out of a fixed power budget.

  • mechanismAnatomy of an AI networking fabricc 0.60

    An AI network is a hierarchy of spine switches (rack-to-rack), top-of-rack leaf switches (intra-rack), and NICs on the chips themselves, arranged as a redundant fabric so isolated FLOP machines can talk fast enough that their output compounds.

  • mechanismRDMA let memory talk directly to memoryc 0.60

    InfiniBand 1.0 in 2000 introduced Remote Direct Memory Access, which lets data move from one machine's memory to another's without involving the CPU — dramatically lowering latency and raising effective I/O bandwidth.

  • implicationAlexNet kicked off the AI interconnect arms racec 0.60

    After AlexNet, clusters had to scale from a few GPUs per box to large accelerator pods, demanding high-bandwidth low-latency fabrics inside datacenters and driving latency-sensitive traffic between them.

  • evidenceJensen bet a third of Nvidia's cash on networkingc 0.60

    Mellanox was 10x larger than any prior Nvidia acquisition and consumed a third of cash on hand, signaling Jensen's conviction that owning the network — not just the GPU — was essential to compete in future compute.

  • contextInference is now the majority of computec 0.60

    Inference grew from a third of compute in 2023 to two-thirds today, and co-design of network, software, models, and hardware now yields meaningful gains — pushing AI systems toward fracturing into specialized shapes.

  • caveatCPO adoption has lagged despite the obvious winsc 0.60

    Tencent deployed Broadcom CPO at hyperscale back in 2021, but only Meta has heavily tested it and Amazon plans CPO in Trainium 4 — Google and Microsoft are pursuing optical circuit switching and XPOs instead, suggesting CPO is harder to deploy than the spec sheet implies.

  • evidenceCPO prototypes still cost 2-3x traditional switchesc 0.60

    The vendor ecosystem for CPO is far behind pluggable optics, and current prototypes run two to three times the cost of traditional switches. Manufacturing scale-up remains expensive.

  • evidenceJensen publicly backs CPO but won't say it on earningsc 0.60

    At GTC 2025 and 2026 Jensen positioned silicon photonics and CPO as essential to stadium-scale datacenters, and Nvidia invested in Coherent. Yet he didn't mention CPO once on the Q1 2026 earnings call.

  • claimHardware-agnostic network observability is the open painpointc 0.60

    SONiC is hard to deploy and used mostly by hyperscalers; Arista's EOS and CloudVision only work on Arista hardware. There's no AI-native, vendor-neutral telemetry tooling, which the author calls an acute gap.

  • implicationSmarter NICs could eliminate the top-of-rack switchc 0.55

    Next-generation NICs are absorbing routing intelligence directly onto the card, which would let designers drop the top-of-rack switch entirely and pack more GPUs into each rack.

  • contextInfiniBand was purpose-built to close the CPU-interconnect gapc 0.55

    The InfiniBand Trade Association was formed in 1999 by Intel, Microsoft, IBM, and others to merge competing fabric efforts and address the widening gap between fast CPUs and slow PCI-bus I/O — explicitly as an alternative to Ethernet for high-performance systems.

  • evidenceOnly Google runs OCS at scale todayc 0.55

    Google deploys OCS in its Jupiter fabric and sources the hardware from Lumentum and Coherent. Startups like iPronics, Omnitron Sensors, and Salience Labs are entering, but the industry is waiting on Google's results before broader adoption.

  • evidenceRoCE congestion costs 30% of Ethernet performancec 0.55

    Rigid RDMA ordering in RoCE clusters bleeds 30% performance. The Ultra Ethernet Consortium's June 2025 spec begins addressing this but better programmable congestion management is still needed.

  • evidenceNvidia claims 1,000,000x inference efficiency gain in six generationsc 0.55

    Tokens per watt for inference has risen a millionfold across six Nvidia architecture generations. Even so, compute supply remains constrained and hardware advances only partially close the gap.

  • contextShannon's bit-and-packet abstraction still defines the stackc 0.50

    Shannon's 1948 framing — encode anything as bits, move them over any medium, treat meaning as irrelevant to the transport — remains the unchanged foundation of modern networking even as the underlying physics has gone from relays to nanoscale CMOS.

  • exampleThe Quantum-X800 marks the end of an erac 0.50

    Nvidia's Quantum-X800 — 800 Gbps, Quantum-3 ASICs, air-cooled, with in-network compute for reductions — represents the state of the art in pluggable-transceiver InfiniBand and likely the last generation before co-packaged optics take over.

  • contextTwo networks: north-south and east-westc 0.50

    A compute cluster has a north-south network connecting it to storage and the outside world, and an east-west network connecting GPUs to GPUs (scale-up) and servers to servers (scale-out) inside the building.

  • contextEthernet was never designed for tight compute couplingc 0.50

    Ethernet was born at Xerox PARC in 1973 to connect computers to printers — cheap, ubiquitous, and built for bursty internet traffic, not the synchronized communication that GPU clusters require.

  • contextHPC scientific workloads were the first GPU adoptersc 0.50

    Highly parallel scientific and engineering workloads — Monte Carlo, FFTs, matrix multiplication — were already CPU- and budget-constrained, making them the natural first customers for GPU acceleration via CUDA.

  • mechanismGPUDirect cut the CPU out of GPU-to-NIC trafficc 0.50

    Nvidia's 2011 GPUDirect let GPUs talk directly to InfiniBand NICs, removing the CPU from the path but still bounded by PCIe bandwidth inside the server.

  • caveatInfiniBand HDR at 200 Gbps was already the bottleneckc 0.50

    The scale-out answer in 2021 was InfiniBand HDR at 200 Gbps per port, but it was not fast enough for thousands of GPUs that needed to stay in lockstep, foreshadowing the next networking wave.

  • mechanismHeat-sensitive lasers force awkward external designsc 0.50

    CPO lasers degrade under heat, so newer designs move them outside the package. This solves the thermal problem but adds assembly and supply chain complexity.

  • caveatLPOs still lag on bandwidthc 0.50

    LPOs top out at 100-200G per lane while legacy DSP infrastructure already runs at 800G, which has limited Arista's adoption. In today's market bandwidth trumps every other consideration.

  • contextThe Optical Scale Up Consortium is hedging across form factorsc 0.50

    In March 2026, AMD, Broadcom, Meta, Microsoft, Nvidia, and OpenAI formed a consortium that pledges support for pluggable, on-board, and co-packaged optics simultaneously. The industry isn't picking a single winner yet.

  • implicationModel architecture alternatives matter because hardware can't fully catch upc 0.50

    Because hardware progress alone won't satisfy insatiable compute demand against physical limits, alternatives to autoregressive architectures that use compute more efficiently are an active area of interest.

  • claimSome abstractions transcend implementationc 0.40

    Crucible's infrastructure thesis rests on the observation that bits-packets-applications has survived every material substrate change since the 1940s, and bets on layers fundamental enough to outlast their current physical implementation.

  • contextCUDA turned GPUs into general-purpose computec 0.40

    Before CUDA, using a GPU for non-graphics work meant expressing math through textures and shaders; CUDA's 2006 release made GPUs C-programmable and opened the door to the scientific and financial workloads that eventually demanded AI-scale networks.

  • mechanismAOCs slipped fiber in without changing the chipsc 0.40

    Active Optical Cables presented the same electrical interface to switches and servers while hiding fiber inside, letting operators swap out copper for optics without redesigning networking silicon.

  • exampleModern clusters layer copper, AOCs, and long-reach opticsc 0.40

    A typical GPU cluster uses copper DAC for the 1–3m server-to-ToR hop, AOC fiber to spine/leaf switches tens of meters away, and long-reach optical transceivers between buildings — each tier dictated by physics, not preference.

  • contextCPO standards are still being written in real timec 0.40

    Thermal specs, mechanical interfaces, and other baseline standards for CPO are not yet stable. The Optical Internetworking Forum is establishing them as deployments happen.

Janitor

Non-content spans (acknowledgements, references, footnotes, headers, boilerplate) are dropped before the decomposition runs.

total spans
10
kept
129
dropped
1
  • content · 9
  • noise · 1