Weekly Digest Week 21
- id: f712d176-06d0-46b6-9a66-afaf9742d2be
%%f712d176-06d0-46b6-9a66-afaf9742d2be_start%%
Networks Are Under AI Pressure: Can Cilium Provide Relief? - Isovalent
In this blog post, we explore why some of the largest AI companies and services are using Cilium and Isovalent.
Highlights
OpenAI’s Kubernetes infrastructure already exceeded 7,500 nodes ). ⤴️ ^ec37bed5
7500 8vCPU nodes [min] would still be a huge TDP wattage! Underscores how chip advancement, and green energy would be instrumental in our AI future
highly reliable networks ⤴️ ^1731b227
Linux networking stack would soon bottleneck when coming to training models across GPUS.
Single CPU + single GPU -> bottleneck is PCIx16 bandwidth + memory lanes.
Sincle CPU + Multi GPU -> bottleneck is PCI Bandwidth. Interrupts and DMA, soft IRQ
Multi node CPU+GPU --> bottle neck is all of the above + NIC + kernel + interconnect. Even with infiniband links across Nodes, high throughput networking in kernel is needed.
Imagine -> CUDA/Pytorch/keras being optimized in userland, Infiniband/NIC offering highest gpbs interconnect, but kernel conking off at moving tensors across machines.
protect their intellectual property ⤴️ ^1fef791e
only when in PaaS mode. i would argue, training is done on private VPC with bastion nodes.
Not on public eavesdroppable internet.
So model theft is a difficult vector.
%%f712d176-06d0-46b6-9a66-afaf9742d2be_end%%