8 dual port 100G NICs (available for a while)or 2 dual port 400 gigabit NICs (no...

Sephr · on Sept 30, 2019

HPC needs HBM or TCI memory, not DDR5. Systems using HBM[1] and TCI[2] can already push an aggregate bandwidth of 8 Tbps.

1. https://en.wikipedia.org/wiki/High_Bandwidth_Memory

2. https://www.hotchips.org/wp-content/uploads/hc_archives/hc26...

zamadatix · on Sept 30, 2019

It's a good point that HBM has high aggregate bandwidth but I still don't think it makes sense to call it a 100x server/network boundary bottleneck when the already available NIC is faster than NVlink in the first place.

What "server/network boundary" is in this case might not be the classical boundary though so maybe they also mean the same thing I'm saying just from a different perspective.

mrb · on Sept 30, 2019

«how are NICs such a bottleneck»

Simple: when the packet data does not need to be processed by the CPU. For example a router forwarding network packets at 10 Tbit/s. The data can stay in the NIC cache as it is being forwarded. No PCIe/CPU/RAM bottleneck here.

Also, EPYC Rome has 1.64 Tbps of RAM bandwidth today (eight DDR4-3200 channels). 10 Tbps is less than three doublings away. It's conceivable server CPUs can reach this bandwidth in 4-6 years.

zamadatix · on Oct 1, 2019

32 port 400G (12.8 terabit/s) 1u routers already exist today, the conversation is about the NIC at the server <-> network boundary not switching ASICs. The only reason you don't see 400G NICs in servers is the lack of the servers ability to put that much bandwidth over PCIe (the real bottleneck location).

Cyph0n · on Oct 1, 2019

The ASR-9000 series can handle up to 3.2 Tbps of L3 traffic per linecard, but this is only achievable because of the dedicated routing ASICs.

GhettoMaestro · on Sept 30, 2019

You still need to account for NIC-to-NIC packet transfers for scenarios where pkt arrives on Phy NIC A and needs to egress via Phy NIC B. Obviously there are better options than just PCIe transport these days.

jdc · on Sept 30, 2019

This is (sort of) a forget everything you know about ______ scenario. But I think the short answer is multi-die interconnects.

https://www.anandtech.com/show/14211/intels-interconnected-f...

https://semiengineering.com/using-silicon-bridges-in-package...

blattimwind · on Sept 30, 2019

> I forget what the bandwidth of CPU cache is but I'm guessing it's not 10 terabit/second either.

L2/L3 have bandwidths around 1-1.5 TB/s these days. Which pretty much is 10 TBit/s ;)

gnufx · on Oct 1, 2019

Surely you don't want to stream through cache, though.

wmf · on Oct 2, 2019

In today's processors all data goes through the cache. There isn't really any other alternative on the horizon.

petra · on Sept 30, 2019

The radeon VII already has 1TB/s memory bandwidth, using HBM2, with HBM2E offering almost double the bandwidth.

Also if we're looking a bit forward, Intel recently demoed, with Ayar Labs, a 2.5D chip with a photonic chiplet that can do optical I/O at 1tbps/mm2.

shaklee3 · on Sept 30, 2019

This is what RDMA solves. you are only limited by the amount of pcie switches you stack up in your topology, and not at all by the processor anymore. all of your data is either handled directly by the nic, or offloaded to an accelerator card. Modern systems can support about 1.6Tbps (100G NICs). When pcie 4 comes out, this should double.

lmilcin · on Sept 30, 2019

You forget we need network performance to be wasted by badly designed software.

the8472 · on Sept 30, 2019

In principle you can use P2P-DMA to shunt the bulk of the data to a specialized device (e.g. GPU, FPGA, storage) without it ever touching main memory or the CPU.