In a world where truth is relative, even the simplest math can become a challenge. AI, it seems, is no exception. Despite their impressive capabilities, large language models struggle with basic arithmetic, as demonstrated by the ORCA benchmark.
ORCA, developed by a team of scientists from Omni Calculator and various European universities, presents a series of math-oriented questions across diverse scientific fields. The results? Five leading LLMs, including ChatGPT-5 and Gemini 2.5 Flash, failed to impress, scoring a mere 63% or less.
But here's where it gets controversial: other benchmarks, like GSM8K and MATH-500, paint a different picture. Some AI models have achieved scores of 0.95 or higher, suggesting near-perfect mathematical prowess. However, the researchers behind ORCA argue that these benchmarks lack scientific rigor and that LLMs still make logical and arithmetic errors.
According to Oxford University's Our World in Data, AI models' math reasoning scores a dismal -7.44 (as of April 2024). The ORCA team believes this is because many benchmark datasets have been incorporated into model training data, akin to students being given the answers before an exam.
ORCA aims to evaluate actual computational reasoning, not just pattern memorization. In their study, distributed via arXiv and Omni Calculator's website, the team found that ChatGPT-5 and its peers struggled with rounding and calculation mistakes, achieving only 45-63% accuracy.
The evaluation, conducted in October 2025, used 500 math prompts across various categories. Gemini 2.5 Flash topped the charts with a 63% accuracy, closely followed by Grok 4. Claude Sonnet 4.5 and DeepSeek V3.2 lagged behind, with the former failing to score above 65% in any category.
And this is the part most people miss: these scores are just a snapshot in time. Models are constantly adjusted and revised. Take, for example, the prompt about LEDs and resistors. Claude Sonnet 4.5 offered two answers, one correct and one incorrect, highlighting the model's uncertainty.
So, while AI benchmarks may not always add up, they provide a fascinating glimpse into the limitations and potential of these powerful tools. The question remains: as we continue to push the boundaries of AI, will we ever truly master the art of machine mathematics? Feel free to share your thoughts in the comments!