#News

New AGI Test ‘ARC-AGI-2’ Comes as a Challenge for Even Advanced AI Models

Date: March 25, 2025

The Arc Prize Foundation has announced a new AGI test called ‘ARC-AGI-2’ with the purpose of measuring AI’s general fluid intelligence.

The test is designed to assign never-seen-before tasks to AI chatbots, which might be easier for humans, but for AIs, that’s not the case.

The test adopts formats from its predecessor, ARC-AGI-1. However, making the whole process more advanced significantly increases the signal strength, reflecting any AI model’s real fluid intelligence.

The ‘ARC-AGI-2’ model is designed to ensure that the systems being tested demonstrate high adaptability and efficiency.

What separates ARC-AGI from alternative benchmarks is the fact that while most benchmarks focus on testing ‘PHD++ Skills,’ this test takes an opposite approach.

As the official announcement states,

“Every ARC-AGI-2 task was solved by at least 2 humans in 2 attempts or less in a controlled study with hundreds of human participants. This matches the rules we hold for AI, which gets two attempts per task.”

Here’s How the Results Looked

François Chollet, co-founder of The Arc Prize Foundation, wrote on X,

“ARC-AGI-2 is fully human-calibrated. We tested these tasks with 400 people in live sessions, and we only kept tasks that could reliably be solved by multiple people. Each eval set (public, private, semi-private) has the exact same human difficulty – average people in our test sample achieve 60% with no prior training, and a panel of 10 people achieve 100%.”

Here are the results based on the official ARC-AGI Leaderboard.

System	ARC-AGI-1	ARC-AGI-2	Efficiency (cost/task)
Human panel (at least 2 humans)	98%	100%	$17
Human panel (average)	64.20%	60%	$17
o3-low (CoT + Search/Synthesis)	75.70%	4%*	$200
o1-pro (CoT + Search/Synthesis)	~50%	1%*	$200*
The ARChitects (Kaggle 2024 Winner)	53.50%	3%	$0.25
o3-mini-high (Single CoT)	35%	0.00%	$0.41
r1 and r1-zero (Single CoT)	15.80%	0.30%	$0.08
gpt-4.5 (Pure LLM)	10.30%	0.00%	$0.29

How Does ARC-AGI-2 Work?

ARC-AGI-2 tests AI fluid intelligence with novel visual puzzles, demanding adaptability and efficiency over brute force. Unlike ARC-AGI-1, it focuses on symbol interpretation, multi-rule reasoning, and context, using a 1,000-task training set and 120-task evaluation sets.

AI gets two attempts per task, yet top models like o3-low (4%) and o1-pro (1%) trail the human average of 60%. Tied to ARC Prize 2025, it pushes for 85% accuracy at $0.42 per task, aiming for true AGI.

In Additional Announcement, the ARC Prize 2025 Made a Return

The ARC Prize has made another return on Kaggle, starting this week. Developers achieving 85% accuracy while spending no more than $0.42 per task are eligible. This dual focus on high performance and low cost aims to drive innovation toward efficient, adaptable AI systems—key traits of artificial general intelligence (AGI).

The contest offers $1 million in prizes, including a $700K Grand Prize for the first team to hit the 85% threshold within Kaggle’s computing limits.

By Arpit Dubey

Arpit is a dreamer, wanderer, and tech nerd who loves to jot down tech musings and updates. With a knack for crafting compelling narratives, Arpit has a sharp specialization in everything: from Predictive Analytics to Game Development, along with artificial intelligence (AI), Cloud Computing, IoT, and let’s not forget SaaS, healthcare, and more. Arpit crafts content that’s as strategic as it is compelling. With a Logician's mind, he is always chasing sunrises and tech advancements while secretly preparing for the robot uprising.

// Recommended

Pinterest Follows Amazon in Layoffs Trend, Shares Fall by 9%

AI-driven restructuring fuels Pinterest layoffs, mirroring Amazon’s strategy, as investors react sharply and question short-term growth and advertising momentum.

Clawdbot Rebrands to "Moltbot" After Anthropic Trademark Pressure: The Viral AI Agent That’s Selling Mac Minis

Clawdbot is now Moltbot. The open-source AI agent was renamed after Anthropic cited trademark concerns regarding its similarity to their Claude models.

Amazon Bungles 'Project Dawn' Layoff Launch With Premature Internal Email Leak

"Project Dawn" leaks trigger widespread panic as an accidental email leaves thousands of Amazon employees bracing for a corporate cull.

OpenAI Launches Prism, an AI-Native Workspace to Shake Up Scientific Research

Prism transforms the scientific workflow by automating LaTeX, citing literature, and turning raw research into publication-ready papers with GPT-5.2 precision.

Have newsworthy information in tech we can share with our community?