Artificial Intelligence & Machine Learning
,
Next-Generation Technologies & Secure Development
ARC-AGI-2 Measures AI’s Ability to Problem Solve With Fewer Resources

A new benchmark to evaluate artificial general intelligence has leading artificial intelligence models stumped.
See Also: Capturing the cybersecurity dividend
The Arc Prize Foundation’s ARC-AGI-2 test aims to assess problem-solving abilities of AI models in novel scenarios. OpenAI’s o1-pro and DeepSeek’s R1 have respectively scored 1% and 1.3% on the test, while non-reasoning models including GPT-4.5, Claude 3.7 Sonnet and Gemini 2.0 Flash similarly achieved scores around 1%.
ARC-AGI-2 builds on its predecessor ARC-AGI-1, which was unbeaten for nearly five years until OpenAI’s o3 model attained human-level performance in late 2024. But the model’s accomplishment came with significant computational expense. The new test introduces a focus on efficiency, measuring a model’s problem-solving ability and the resources it consumes.
AI researcher and Arc Prize Foundation Co-Founder François Chollet said in a post on X, formerly Twitter, that the updated test resolves shortcomings of ARC-AGI-1. The revised benchmark discourages reliance on excessive computing power, focusing on efficiency as a critical measure of progress toward AGI.
“Intelligence is not solely defined by the ability to solve problems or achieve high scores,” said Greg Kamradt, president of the Arc Prize Foundation, in a blog post. “The efficiency with which those capabilities are acquired and deployed is a crucial, defining component.”
The ARC-AGI-2 presents puzzle-like challenges that require AI models to analyze visual patterns from multicolored square grids and generate solutions without prior exposure. Unlike traditional benchmarks that may favor models with massive computational power, this test is designed to reward adaptive reasoning. Over 400 human participants, who established a performance baseline, averaged 60% accuracy, significantly surpassing AI results.
While OpenAI’s o3 model demonstrated dominance on ARC-AGI-1 with a 75.7% score, its performance plummeted to 4% on ARC-AGI-2, despite using $200 worth of compute per attempt for some tasks. By introducing constraints to prevent brute-force approaches, ARC-AGI-2 aims to provide a more accurate measure of AI’s problem-solving capabilities.
Unlike tests that emphasize memorization or repetitive training, ARC-AGI-2 seeks dynamic reasoning and adaptability, potentially providing clearer insights into whether AI can genuinely learn and apply knowledge in unfamiliar situations.
“We know that brute-force search could eventually solve ARC-AGI given unlimited resources and time to search. This would not represent true intelligence,” the blog post said. “Intelligence is about finding the solution efficiently, not exhaustively.”
In addition to announcing the benchmark, the foundation launched the Arc Prize 2025 competition. Participants are tasked with achieving an 85% accuracy score on ARC-AGI-2 while limiting computational costs to $0.42 per task, aiming to reward innovations in AI reasoning over reliance on expensive computational power.