Description
Chip-Level Compute Efficiency (TFLOPS per Watt)
| Metric | NVIDIA H200 SXM | NVIDIA B300 (Blackwell Ultra SXM) |
|---|---|---|
| TDP | 700 W | 1,400 W |
| FP8 Dense Compute | ~1,979 TFLOPS | ~4,500–7,000 TFLOPS |
| FP8 TFLOPS/W | ~2.83 | ~3.2–5.0 |
| FP4 Dense Compute | Not Supported | ~15,000 TFLOPS (NVFP4) |
| FP4 TFLOPS/W | N/A | ~10.7 |
On raw silicon, the B300 delivers roughly 1.8×–2.2× better FP8 compute-per-watt than the H200. The gap widens significantly with FP4 precision, which the H200 cannot execute natively—this gives the B300 a ~3.8× theoretical energy efficiency advantage at the arithmetic level.
Workload-Level Efficiency — Tokens per Watt (LLM Inference)
Paper specs don’t reflect memory-bound behavior in LLM inference. The B300’s 288 GB HBM3e (vs. 141 GB on H200) allows a 70B–110B parameter model + long-context KV cache to fit entirely on a single GPU at FP8/FP4, eliminating cross-GPU tensor-parallel communication overhead during inference.
- FP8 Inference: Single B300 GPU can hold larger models with KV cache resident, reducing NVLink traffic and inference latency. Effective tokens-per-watt is typically 3×–5× that of H200 for 70B-class models.
- FP4 Inference (B300 Exclusive): Using NVFP4 with the 2nd-Gen Transformer Engine, Blackwell Ultra (GB300 NVL72 rack-scale) delivers up to 50× higher throughput per megawatt and 35× lower cost per million tokens versus Hopper-class platforms (H200/H100) per NVIDIA’s published data.
- Training Workloads: With NVLink 5.0 (1.8 TB/s bidirectional vs. H200’s 900 GB/s), collective communication overhead per epoch is reduced. Effective training throughput per megawatt is approximately 2×–3× that of an H200-based cluster for large-batch FP8 pretraining.
Facility-Level Consideration — Cooling Overhead
The B300’s 1,400 W TDP requires direct-to-chip cold-plate liquid cooling with CDU and pump power adding ~5–10 % to rack-level energy draw, whereas the H200 can operate in air-cooled or hybrid-cooled racks at 700 W. When calculating TCO, use rack-level power (IT load + cooling overhead) rather than GPU TDP alone—even accounting for liquid cooling parasitics, the B300 completes the same inference or training workload with substantially less total energy consumption.
Bottom Line
The NVIDIA B300 offers ~2× better raw compute-per-watt at FP8 and ~4× at FP4 versus the NVIDIA H200. In real LLM inference deployments, the effective energy efficiency measured in tokens-per-megawatt ranges from 3× better (FP8) to 50× better (FP4, rack-scale) due to the combined benefits of larger HBM3e capacity, native FP4 acceleration, and doubled attention-layer throughput. The tradeoff is the mandatory liquid-cooling infrastructure for B300 deployment.
WhatsApp:+86 18150087953 WeChat: +86 18150087953
Email:


