MLPerf Inference 4.1 Benchmarks: A Closer Look at the Latest Hardware
The MLCommons recently released a new set of benchmarks in the inferencing domain, aiming to provide a better comparison of data center hardware. As an independent organization, MLCommons sets the test conditions and constraints to ensure a fair evaluation process. Participants adhere to these guidelines, and all results are subject to review by other participants to prevent any manipulation. Unlike direct manufacturer benchmarks, MLCommons‘ results are designed to be more comparable as they avoid manufacturer-specific optimizations in the closed category, such as pre-training of data. In the open category, hardware and software manufacturers can further optimize their systems. Before delving into the detailed results, let’s take a look at the MLPerf Inference 4.1 results.
Across all results, it is evident that even with identical hardware, computing performance has increased by double-digit percentages. NVIDIA’s submissions stand out as they consistently delivered results across different Hopper iterations. Software improvements alone have led to performance gains of up to 30% in just a few months. Another highlight of the MLPerf Inference 4.1 results is the inclusion of the MoE model (Mixture of Experts) Mixtral 8x7B. MoEs utilize multiple AI models, with a „Gating Network“ determining the most suitable AI for user input. By combining the strengths of individual AIs, MoEs enhance performance significantly.
MoE Benchmark: NVIDIA’s Dominance and Future Potential
While there are limited submissions for the MoE benchmark, NVIDIA’s hardware, including the H100, H200, and GH200 accelerators, has been the focus. The performance difference between eight H100 and H200 accelerators is not as substantial as expected, highlighting MoEs‘ early developmental stage. Software enhancements and hardware requirements will continue to evolve, with NVIDIA’s forthcoming Blackwell GPU expected to further optimize MoEs.
Instinct MI300X vs. NVIDIA H200: A Comparative Analysis
AMD’s Instinct MI300X accelerator has entered the competition, allowing for a direct comparison with NVIDIA’s current solutions, H100 and H200. While AMD and NVIDIA have previously engaged in public disputes over proprietary benchmarks, the latest data offers a more objective assessment. The performance comparison indicates that the H200 accelerator outperforms the Instinct MI300X in inferencing tasks involving Llama2 with 70 billion parameters. Despite the Instinct MI300X’s larger HBM3 memory capacity, the H200 accelerator exhibits superior performance, showcasing NVIDIA’s lead in this segment.
B200 vs. H200, GH200, and Instinct MI300X: NVIDIA’s Next-Generation Accelerator
NVIDIA has previewed the B200 accelerator based on the Blackwell architecture, providing independent insights into its computational capabilities. The benchmark results highlight the B200’s significant performance leap over its predecessors and competitors, underscoring NVIDIA’s technological advancement in the accelerator space. While the exact power requirements for the B200 are undisclosed, its performance in inferencing tasks involving Llama2 with 70 billion parameters outshines existing solutions.
Intel’s upcoming Granite Rapids processors, an extension of the Xeon-6 product line, are set to debut in the next quarter. Preliminary benchmarks showcasing the new processors, referred to as Granite Rapids, can be compared to the previous Emerald Rapids for CPU inferencing tasks. The increased core count from Emerald Rapids to Granite Rapids indicates a notable performance boost, aligning with market expectations for enhanced computational capabilities. The comparison with NVIDIA’s Grace Hopper accelerator further elucidates the performance disparities between CPU inferencing and dedicated AI accelerators.
Google TPU v6e: Advancements in Google’s Inferencing Technology
Google’s TPU v6e, introduced at Google I/O, promises significant speed improvements and enhanced efficiency compared to its predecessor, the TPU v5e. The TPU v6e, also known as Trillium, is designed to deliver superior performance in inferencing tasks, as demonstrated in the preview submission for Stable Diffusion XL. The substantial increase in INT8 compute performance underscores Google’s commitment to advancing inferencing capabilities, though the hardware is not yet widely available.
In conclusion, the MLPerf Inference 4.1 benchmarks provide valuable insights into the evolving landscape of data center hardware performance. NVIDIA’s advancements with the B200 accelerator, AMD’s entry with the Instinct MI300X, Intel’s Granite Rapids processors, and Google’s TPU v6e showcase the competitive dynamics shaping the industry. As technology continues to progress, the demand for efficient inferencing solutions remains high, underscoring the importance of ongoing innovation and optimization in hardware and software development. The complete results for MLPerf Inference 4.1 can be accessed directly through MLCommons.