8 dual port 100G NICs (available for a while)or 2 dual port 400 gigabit NICs (not yet available) would out-bandwidth the memory controller on an Epyc 7742; how are NICs such a bottleneck that they need to be increased 100x to keep up when DDR5 only doubles the bandwidth?
I forget what the bandwidth of CPU cache is but I'm guessing it's not 10 terabit/second either.
It's a good point that HBM has high aggregate bandwidth but I still don't think it makes sense to call it a 100x server/network boundary bottleneck when the already available NIC is faster than NVlink in the first place.
What "server/network boundary" is in this case might not be the classical boundary though so maybe they also mean the same thing I'm saying just from a different perspective.
Simple: when the packet data does not need to be processed by the CPU. For example a router forwarding network packets at 10 Tbit/s. The data can stay in the NIC cache as it is being forwarded. No PCIe/CPU/RAM bottleneck here.
Also, EPYC Rome has 1.64 Tbps of RAM bandwidth today (eight DDR4-3200 channels). 10 Tbps is less than three doublings away. It's conceivable server CPUs can reach this bandwidth in 4-6 years.
32 port 400G (12.8 terabit/s) 1u routers already exist today, the conversation is about the NIC at the server <-> network boundary not switching ASICs. The only reason you don't see 400G NICs in servers is the lack of the servers ability to put that much bandwidth over PCIe (the real bottleneck location).
You still need to account for NIC-to-NIC packet transfers for scenarios where pkt arrives on Phy NIC A and needs to egress via Phy NIC B. Obviously there are better options than just PCIe transport these days.
This is what RDMA solves. you are only limited by the amount of pcie switches you stack up in your topology, and not at all by the processor anymore. all of your data is either handled directly by the nic, or offloaded to an accelerator card. Modern systems can support about 1.6Tbps (100G NICs). When pcie 4 comes out, this should double.
In principle you can use P2P-DMA to shunt the bulk of the data to a specialized device (e.g. GPU, FPGA, storage) without it ever touching main memory or the CPU.
I forget what the bandwidth of CPU cache is but I'm guessing it's not 10 terabit/second either.