
When it comes to artificial intelligence, measuring performance isn't as straightforward as traditional computing benchmarks. That's where MLCommons steps in. With over 125 members, this non-profit consortium, including major tech players and startups, has developed the go-to standard for measuring AI system performance. Their work spans six continents (they're still looking for that elusive Antarctic partner) and has generated over 56,000 benchmark results. The 60th IT Press Tour recently met with MLCommons. Here's what I learned.
Why MLCommons Matters Now
Here's the thing about AI development - it's moving at breakneck speed. As David Kanter, MLCommons' Executive Director, points out, "What gets measured gets improved." This rings especially true in AI, where model sizes are growing exponentially. The consortium's data shows that transformer models, which power many of today's AI applications, are increasing by about 750-times every two years - far outpacing traditional hardware improvements.
Breaking Down the Benchmarks
MLCommons' approach goes far beyond simple speed tests. Their comprehensive benchmarking system examines several key areas:
Time-to-train: How quickly systems can learn from data, with strict quality thresholds that must be met
Inference performance: How well systems can apply what they've learned, measured across different scenarios like single queries and batch processing
Power efficiency: Essential metrics for data centers trying to balance performance with energy costs
Storage capabilities: Critical for handling the massive datasets that modern AI requires
The MLPerf Suite: A System-Wide View
MLPerf, the flagship benchmark suite, takes what Kanter calls a "full system" approach. It looks at:
Scale: How systems handle increasing data and model sizes
Algorithms: The efficiency of different AI approaches
Silicon: Hardware performance at the chip level
Software: Framework and implementation efficiency
Architecture: Overall system design effectiveness
Real-World Applications
The benchmarks come in two primary flavors - closed and open division. In the closed division, everyone tests mathematically equivalent models, creating an even playing field for comparison. The open division allows for innovation, letting companies showcase new approaches while meeting quality standards.
One interesting recent development is the explosion of generative AI benchmarking. The latest MLPerf training round saw a 46% increase in submissions across GPT3, Stable Diffusion, and Llama 2 models.
Power and Performance
Energy efficiency has become a significant focus. The latest round of MLPerf Training included the industry's first data center-scale full-system power measurement methodology, applicable to both on-premises and cloud environments. Companies like Dell are now submitting power measurement results alongside performance data, marking a shift toward more sustainable AI development.
Future Directions
MLCommons keeps expanding its scope. They're working with the automotive industry through AVCC to develop specialized benchmarks for vehicle AI systems, covering everything from safety features to user experience. Their latest client benchmark focuses on large language models, reflecting the growing importance of AI in everyday computing.
Getting Involved
Companies can participate at different levels, with membership tiers based on organization size. For companies under 500 employees, a small company tier makes participation more accessible. What's particularly valuable is that all benchmark results become public, creating a rich dataset for the entire industry.
The consortium also offers different submission categories:
Available: For commercially ready products
Preview: For products coming to market within 6 months
RDI: For research and development innovations
Why It Matters for IT Professionals
For IT teams evaluating AI solutions, MLPerf results provide concrete data for decision-making. European supercomputer procurement teams already use these benchmarks in their RFPs, and many enterprise customers rely on them for system selection.
As Kanter emphasizes, "We're not just measuring performance - we're helping create the conditions for better AI development across the board." With AI becoming central to more business operations, having standardized ways to measure and compare solutions becomes increasingly valuable.
The real strength of MLCommons lies in its community approach. Bringing together competitors to agree on measurement standards helps advance the entire field while giving both vendors and customers the tools they need to make more informed decisions.
Comments