August 15, 2024, marked the launch of Geekbench AI 1.0, a tool designed to bring clarity and consistency to AI benchmarking. Geekbench AI measures AI workloads across platforms. It has gained attention from developers, hardware engineers, and performance enthusiasts.
But does it truly live up to its promises and deliver cutting-edge abilities?
This article will examine Geekbench AI 1.0’s methods, limits, strengths, and weaknesses. Is it really the best tool for measuring AI and machine learning performance? Or is it just another benchmark to use with caution?

The Promise of Geekbench AI 1.0
John Poole of Primate Labs announced that Geekbench AI 1.0 results from years of feedback and work with the AI engineering community. The benchmark aims to reflect AI workloads in the real world and should have the same cross-platform utility and relevance as Geekbench.
Geekbench AI is available for Windows, macOS, Linux, Android, and iOS. It has a suite of tasks for developers to check app performance across platforms, and hardware engineers can use it to measure architectural improvements.
The tool’s name change from “Geekbench ML” to “Geekbench AI” aligns with a trend. Companies are now focusing on “AI” in their marketing and products. This rebranding aims to clarify the benchmark measures and how they work for all stakeholders, including engineers and enthusiasts.
A Closer Look at the Workloads
Having worked with various AI models and benchmarks over the years, I was intrigued by the breadth of workloads in Geekbench AI 1.0. The benchmark covers many tasks, including computer vision (e.g., image classification, object detection) and NLP (e.g., text classification, machine translation). On paper, Geekbench AI has all the bases covered, but does it really reflect the diversity of AI applications?
After going through their Geekbench AI 1.0 Inference Workloads document, I was struck by their choice of models for these tasks. Geekbench AI uses lightweight models for image and text classification. They are MobileNetV1 for images and BERT-Tiny for text.
Both are optimized for efficiency. These models are practical for mobile and embedded use. But, they show the true power of advanced AI systems. In my work, I’ve found that EfficientNet and ResNet are better.
They are more powerful and versatile, especially in high-performance cases. Focusing on smaller models might skew benchmark results. They may not reflect the performance of devices optimized for demanding apps.
BERT-Tiny is efficient for text classification, but it lacks the depth and sophistication of larger models I’ve used in NLP projects. This choice could skew the benchmark toward simpler tasks, and it might miss insights critical for those using complex, large-scale NLP models.
Scoring System: Simplification at a Cost?

Geekbench AI gives three overall scores: Single Precision, Half Precision, and Quantized. Each score represents a different level of computational precision. These scores are calculated as the geometric mean of corresponding workload scores, with adjustments for accuracy.
At first, this seems a simple way to compare AI performance across devices. But, I’ve found that simplification often comes at a cost.
Focusing on specific metrics, like Top-1 accuracy for image classification and F1 scores for object detection, could narrow the focus to just some aspects of performance. In my experience, AI performance is multi-dimensional. Reducing it to a few scores risks obscuring important nuances.
For example, a device that excels in one area might have big drawbacks in another. These details can get lost in the aggregate scores.
What People Are Saying about Geekbench AI 1.0
The launch of Geekbench AI 1.0 has sparked various discussions across the AI and developer communities, including on platforms like Reddit. Some users appreciate the addition of support for new frameworks like OpenVINO on Linux and Windows and vendor-specific TensorFlow Lite delegates for Android, such as Samsung ENN, ArmNN, and Qualcomm QNN.
This update is crucial because it reflects the latest tools and changes in how developers build apps on modern hardware.

However, there’s also criticism. For instance, one user pointed out that Geekbench ML had been “totally useless” on Android for the past few years because Android vendors stopped updating their NNAPI drivers, opting for vendor-specific delegates.
This shift rendered previous versions of Geekbench less effective. While Geekbench AI addresses this issue with new support, it raises questions about how responsive benchmarking tools must be to industry changes.
This sentiment reflects a broader concern within the community: the need for benchmarks to evolve alongside industry practices and hardware developments. While the new features in Geekbench AI are a step in the right direction, some users still need to be more cautious about whether it can keep up with the fast-paced changes in AI technology.
The Industry Implications
It says that major companies like Samsung and Nvidia use Geekbench AI in their workflows. This widespread use suggests that Geekbench AI could soon be the AI benchmarking standard. However, this also raises concerns for me.
Over the years, I’ve seen that a single benchmark’s dominance can lead to a narrow focus on optimizing for that tool. Instead, we should aim for a broad performance improvement.
If Geekbench AI becomes the top benchmark, a risk arises. Companies may prioritize high scores over real-world effectiveness. This is not new. I’ve seen it in other tech areas. The quest for benchmark supremacy led to poor decisions and a loss of focus on more meaningful metrics.
Final Thoughts: Proceed with Caution
Geekbench AI 1.0 is a big step in making a standard AI benchmark. But, I urge caution in interpreting its results. Geekbench AI offers valid cross-platform comparisons. However, developers should avoid relying only on these scores to make informed decisions. Due to model selection and evaluation trade-offs, Geekbench AI should be one of many tools, not the definitive measure of AI performance.
New models and apps appear daily; no benchmark can capture them all. Based on my work, the best approach is to use Geekbench AI as part of a broader toolkit. This ensures our AI models are optimized for real-world challenges, not just benchmarks.
Discover more from AI For Developers
Subscribe to get the latest posts sent to your email.