Weekly Digest Week 21

- id: f712d176-06d0-46b6-9a66-afaf9742d2be

%%f712d176-06d0-46b6-9a66-afaf9742d2be_start%%

Networks Are Under AI Pressure: Can Cilium Provide Relief? - Isovalent

Omnivore | Original

In this blog post, we explore why some of the largest AI companies and services are using Cilium and Isovalent.

Highlights

OpenAI’s Kubernetes infrastructure already exceeded 7,500 nodes ). ⤴️ ^ec37bed5

7500 8vCPU nodes [min] would still be a huge TDP wattage! Underscores how chip advancement, and green energy would be instrumental in our AI future

highly reliable networks ⤴️ ^1731b227

Linux networking stack would soon bottleneck when coming to training models across GPUS.

Single CPU + single GPU -> bottleneck is PCIx16 bandwidth + memory lanes.

Sincle CPU + Multi GPU -> bottleneck is PCI Bandwidth. Interrupts and DMA, soft IRQ

Multi node CPU+GPU --> bottle neck is all of the above + NIC + kernel + interconnect. Even with infiniband links across Nodes, high throughput networking in kernel is needed.
Imagine -> CUDA/Pytorch/keras being optimized in userland, Infiniband/NIC offering highest gpbs interconnect, but kernel conking off at moving tensors across machines.

protect their intellectual property ⤴️ ^1fef791e

only when in PaaS mode. i would argue, training is done on private VPC with bastion nodes.

Not on public eavesdroppable internet.

So model theft is a difficult vector.

%%f712d176-06d0-46b6-9a66-afaf9742d2be_end%%