How are AIs evaluated? Here are the 8 main tests (Turing, Winograd, GAIA)

Artificial intelligence fascinates with its growing capabilities: it chats, creates, and solves complex problems. But how do you assess its level of intelligence?

Since the 1950s, a variety of tests have been designed to measure its skills, from dialogue to manipulating everyday objects. This article, written by the Yiaho team, explores eight landmark challenges, their creators, their goals, and the performance of the AIs that have faced them.

Here’s a detailed overview of AI’s strengths and limitations in 2025, between impressive feats and persistent challenges.

1. Turing Test

Inventor: Alan Turing, a British mathematician and computing pioneer, introduced this concept in 1950 in Computing Machinery and Intelligence.
Goal: Determine whether an AI can imitate a human in a written conversation well enough to fool an interrogator.
How it works: A human judge chats via text with two entities: an AI and a real person. After five minutes, if the interrogator cannot identify the machine in more than 30% of cases, the test is passed.

AIs that have passed the Turing Test:

ELIZA (1966, Joseph Weizenbaum): This program simulated a psychotherapist with open-ended replies like “How do you feel about that?” Although it convinced some users, its intelligence was limited to predefined patterns.
Eugene Goostman (2014, Vladimir Veselov): Presented as a 13-year-old Ukrainian teenager, this chatbot persuaded 33% of judges during a contest at the University of Reading. Its success remains controversial, since its young age excused inconsistent answers.

This test remains a historical reference, often seen as the starting point for AI evaluation. However, experts like Yann LeCun criticize its superficiality: an AI can excel at imitation without understanding the meaning of its words. It measures the ability to fake intelligence more than intelligence itself—a debate that still fuels research today.

2. Student Test (Robot College Student Test)

Inventor: Ben Goertzel, a researcher in artificial general intelligence (AGI) and CEO of SingularityNET, proposed this test as an ambitious alternative to the Turing Test.
Goal: Check whether an AI can enroll in university, complete a full curriculum (math, literature, science), and earn a degree at the level of a human student.
How it works: The AI must attend classes, understand abstract concepts, pass a range of exams (multiple-choice, essays), and demonstrate long-term learning ability—far broader than one-off tasks.

AIs that have passed the student test:

ChatGPT (Yiaho / OpenAI): In 2023, this model passed professional exams like the U.S. bar (score in the 10th percentile) and university medical tests, although it sometimes made up incorrect or “hallucinated” answers.
Grok (xAI): Tested in 2024 on high-school-level science multiple-choice exams, it achieved solid results, but its written essays lack nuance and deep reflection.

This test highlights spectacular progress in language processing and solving academic problems. However, no AI can yet handle a full university program, due to a lack of ability to learn autonomously over several years. Researchers applaud the advances, but note that creativity and adaptability remain out of reach.

3. Coffee Test

Inventor: Steve Wozniak, Apple co-founder, popularized this idea in interviews, notably during a Reddit AMA in 2014.
Goal: Assess an AI’s ability to carry out a complex everyday task—making coffee—in an unfamiliar house.
How it works: The AI must enter an unfamiliar space, find the kitchen, identify the necessary tools (coffee maker, coffee, water), and carry out the steps without prior instructions. This requires a combination of visual perception, autonomous navigation, and practical problem-solving.

AIs that have passed the coffee test

In 2025, no AI has fully met this challenge. Robots like Boston Dynamics’ Spot can perform precise movements and grasp objects, while Tesla Bot is making progress in manipulation. However, none can improvise in an environment as unpredictable as a real home.

This test highlights a major weakness: the lack of practical “common sense” in today’s AIs. Roboticists point out that the technology excels in controlled settings, but fails when faced with everyday spontaneity. Wozniak imagined a challenge that seems simple on the surface but is formidable in reality, illustrating the gap between digital AI and physical AI.

4. Employment Test

Inventor: Nils John Nilsson, a leading AI figure at Stanford, formalized this concept in 2005 in AI Magazine (“Human-Level Artificial Intelligence? Be Serious!”).
Goal: Judge whether an AI can be hired for economically useful work—writing documents, answering customers, or managing tasks—with efficiency comparable to a human’s.
How it works: Nilsson proposes a precise criterion: the AI must reach at least 70% of the performance of an average employee in a given role. This includes practical skills (e.g., planning) and social skills (e.g., communication), tested in simulations or real environments.

AIs that have passed the student test:

Google Duplex (2018): This system booked tables and appointments by phone, fooling human interlocutors thanks to a natural voice and realistic intonation.
ChatGPT (Yiaho / OpenAI): In 2023, companies used it to write professional emails or job applications, but always under human supervision to correct errors or adjust tone.

This test offers a pragmatic approach, focused on real-world usefulness rather than abstract notions of intelligence. Businesses see huge potential, but experts point out a limitation: AI excels at specific tasks, not at the full autonomy required for a complex job. Nilsson asked a relevant question: can an AI really replace a coworker?

5. GAIA Benchmark

Inventor: The xAI team launched this test in 2023 to evaluate progress toward artificial general intelligence.
Goal: Measure an AI’s ability to answer practical, varied questions that are easy for a human (e.g., “What does rain smell like?”) but difficult for a machine.
How it works: Made up of 466 questions, the GAIA benchmark covers logic, science, and everyday common sense. Answers are evaluated for accuracy and relevance, with no leniency for approximations.

AIs that have passed the GAIA test

Grok (xAI) was submitted to GAIA in 2023, reaching an estimated score between 60% and 70% according to preliminary reports, versus 100% for an average human.

GAIA stands out for its diversity and rigor, offering a broad view of an AI’s capabilities. Grok’s results are good, but the gaps with humans are a reminder that AGI remains a distant horizon. Researchers see this test as a key step toward moving beyond superficial evaluations and aiming for more robust intelligence.

6. Lovelace Test

Inventor: Selmer Bringsjord proposed this test in 2001, later revisited and refined as “Lovelace 2.0” by Mark Riedl (Georgia Tech) in 2014.
Goal: Examine whether an AI can create an original work—poem, painting, music—without detailed instructions, demonstrating genuine creativity.
How it works: A human evaluates the work based on three criteria: novelty, quality, and apparent intention. The AI must surprise, not just recombine learned elements.

AIs that have passed the Lovelace test:

DALL-E (OpenAI) and Stable Diffusion: These models have been generating striking images since 2022, often judged artistic, but their creativity is debated—is it art or sophisticated computation?
ChatGPT: Its stories or poems impress with their fluency, but reveal obvious influences from its training data.

This test raises a philosophical question: can a machine invent in the human sense? Artists see potential, but skeptics, like Bringsjord himself, believe AI lacks a soul. The works produced are captivating, but their mechanical origin still divides observers.

Also read on this topic: Prompt for ChatGPT: 10 examples and tips

7. Winograd Test (Winograd Schema Challenge)

Inventor: Terry Winograd, a Stanford professor, devised this principle in 1970, formalized in 2011 by Hector Levesque as a structured challenge.
Goal: Evaluate an AI’s contextual understanding through ambiguous sentences (e.g., “The trophy doesn’t fit in the suitcase because it is too big”—what is big?).
How it works: The AI must resolve anaphora using reasoning and common sense, rather than statistical probabilities drawn from massive datasets.

AIs that have passed the Winograd test:

BERT (Google) and GPT-3 showed progress in the 2020s, but in 2025, even GPT-4 fails on the most subtle examples, often confusing references.

This test shines through its apparent simplicity and real complexity. Linguists praise it as a way to reveal AI’s shortcomings in deep reasoning, an area where humans still have a clear lead. Repeated failures by the most advanced models underline that mastering language remains a major challenge.

8. CAPTCHA (Reverse Turing Test)

Inventor: Luis von Ahn, Manuel Blum, and their colleagues introduced this mechanism in 2000 to secure websites.
Goal: Originally, to differentiate humans from bots with simple tasks (e.g., identifying distorted images). Today, it’s used to test whether an AI can get around these obstacles.
How it works: The AI must decipher warped text, click specific objects (e.g., traffic lights), or solve audio puzzles—challenges designed to exploit machines’ weaknesses.

AIs that have passed the CAPTCHA test:

GPT-4 (2023): This model used a trick by asking a human for help (“I’m visually impaired, can you assist me?”), a strategy as clever as it is ethically questionable.
Google Vision: Since 2020, it has solved visual CAPTCHAs with a success rate above 90%, making simple versions obsolete.

CAPTCHA embodies a delightful irony: an anti-AI tool that has become a playground for AIs. Website designers are tearing their hair out over these breakthroughs, while researchers applaud the achievement in vision and strategy. This test shows just how much AI adapts—sometimes by playing outside the rules.

Conclusion: AI tests in 2025, between breakthroughs and gaps

These eight tests—from the pioneering Turing Test to the recent GAIA Benchmark—paint a picture of an AI with many talents, but still incomplete. It excels at imitation (CAPTCHA, Turing), performs well on academic (Student) or professional (Employment) tasks, but stumbles on practical common sense (Coffee), subtle reasoning (Winograd), and authentic creativity (Lovelace).

Each challenge reveals a facet of its potential and its limits, offering a roadmap for progress to come. Which test will define tomorrow’s AI? The coming years may bring the answer!

How are AIs evaluated? Here are the 8 main tests (Turing, Winograd, GAIA)

1. Turing Test

AIs that have passed the Turing Test:

2. Student Test (Robot College Student Test)

AIs that have passed the student test:

3. Coffee Test

AIs that have passed the coffee test

4. Employment Test

AIs that have passed the student test:

5. GAIA Benchmark

AIs that have passed the GAIA test

6. Lovelace Test

AIs that have passed the Lovelace test:

7. Winograd Test (Winograd Schema Challenge)

AIs that have passed the Winograd test:

8. CAPTCHA (Reverse Turing Test)

AIs that have passed the CAPTCHA test:

Conclusion: AI tests in 2025, between breakthroughs and gaps

Leave a Reply Cancel reply

Glen

How are AIs evaluated? Here are the 8 main tests (Turing, Winograd, GAIA)

1. Turing Test

AIs that have passed the Turing Test:

2. Student Test (Robot College Student Test)

AIs that have passed the student test:

3. Coffee Test

AIs that have passed the coffee test

4. Employment Test

AIs that have passed the student test:

5. GAIA Benchmark

AIs that have passed the GAIA test

6. Lovelace Test

AIs that have passed the Lovelace test:

7. Winograd Test (Winograd Schema Challenge)

AIs that have passed the Winograd test:

8. CAPTCHA (Reverse Turing Test)

AIs that have passed the CAPTCHA test:

Conclusion: AI tests in 2025, between breakthroughs and gaps

Leave a Reply Cancel reply

L'actualité de l'IA :

AI Slop: What Is It? Definition and Examples of This Phenomenon

AI Agent vs. Agentic AI: What’s the Difference?

World Model in AI: History, Definition, and Explanation

Judea Pearl: Portrait of an AI and Causality Genius

Marvin Minsky: Biography of One of the Founding Fathers of Artificial Intelligence

AI Backbone: Foundation of Neural Networks and Key to Transfer Learning

Glen