AI Fails Math Test: ORCA Benchmark Exposes Weaknesses in Leading Language Models (2025)

In a world where truth is relative, even the simplest math can become a challenge. AI, it seems, is no exception. Despite their impressive capabilities, large language models struggle with basic arithmetic, as demonstrated by the ORCA benchmark.

ORCA, developed by a team of scientists from Omni Calculator and various European universities, presents a series of math-oriented questions across diverse scientific fields. The results? Five leading LLMs, including ChatGPT-5 and Gemini 2.5 Flash, failed to impress, scoring a mere 63% or less.

But here's where it gets controversial: other benchmarks, like GSM8K and MATH-500, paint a different picture. Some AI models have achieved scores of 0.95 or higher, suggesting near-perfect mathematical prowess. However, the researchers behind ORCA argue that these benchmarks lack scientific rigor and that LLMs still make logical and arithmetic errors.

According to Oxford University's Our World in Data, AI models' math reasoning scores a dismal -7.44 (as of April 2024). The ORCA team believes this is because many benchmark datasets have been incorporated into model training data, akin to students being given the answers before an exam.

ORCA aims to evaluate actual computational reasoning, not just pattern memorization. In their study, distributed via arXiv and Omni Calculator's website, the team found that ChatGPT-5 and its peers struggled with rounding and calculation mistakes, achieving only 45-63% accuracy.

The evaluation, conducted in October 2025, used 500 math prompts across various categories. Gemini 2.5 Flash topped the charts with a 63% accuracy, closely followed by Grok 4. Claude Sonnet 4.5 and DeepSeek V3.2 lagged behind, with the former failing to score above 65% in any category.

And this is the part most people miss: these scores are just a snapshot in time. Models are constantly adjusted and revised. Take, for example, the prompt about LEDs and resistors. Claude Sonnet 4.5 offered two answers, one correct and one incorrect, highlighting the model's uncertainty.

So, while AI benchmarks may not always add up, they provide a fascinating glimpse into the limitations and potential of these powerful tools. The question remains: as we continue to push the boundaries of AI, will we ever truly master the art of machine mathematics? Feel free to share your thoughts in the comments!

AI Fails Math Test: ORCA Benchmark Exposes Weaknesses in Leading Language Models (2025)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Carmelo Roob

Last Updated:

Views: 6410

Rating: 4.4 / 5 (45 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Carmelo Roob

Birthday: 1995-01-09

Address: Apt. 915 481 Sipes Cliff, New Gonzalobury, CO 80176

Phone: +6773780339780

Job: Sales Executive

Hobby: Gaming, Jogging, Rugby, Video gaming, Handball, Ice skating, Web surfing

Introduction: My name is Carmelo Roob, I am a modern, handsome, delightful, comfortable, attractive, vast, good person who loves writing and wants to share my knowledge and understanding with you.