In artificial intelligence, evaluating system capabilities is becoming a challenge—but also a bit of a game to see which company offers the best AI on the market! Benchmarks, or reference tests, play a key role in measuring progress in AI models.
Among the latest arrivals, the GAIA benchmark (General AI Assistants), developed in 2023 by researchers from Meta AI, Hugging Face, AutoGPT, and GenAI, stands out as an important step toward understanding artificial general intelligence (AGI).
But what is GAIA, why is it different, and what does it tell us about the current state of AI? This article written by the Yiaho team explains everything—simply and in detail!
What is the GAIA Benchmark?
GAIA is a set of 466 questions designed to test the capabilities of AI assistants in realistic, real-world scenarios.
Published on November 21, 2023 on arXiv, this benchmark aims to assess whether an AI system can achieve robustness comparable to that of an average human when faced with a variety of tasks. You can also find the results on HuggingFace.
Unlike other tests that focus on specialized skills (such as solving complex math problems or answering legal questions), GAIA emphasizes questions that are conceptually simple for humans but often difficult for advanced AIs.
These questions require core skills such as:
- Reasoning: making logical deductions from given information.
- Multimodal handling: interpreting text, images, tables, or other formats.
- Web browsing: searching for information online.
- Tool use: knowing when and how to use external resources.
The goal? Identify whether an AI can truly act as a general-purpose assistant, able to meet practical everyday needs—not just excel in narrow domains.
Why is GAIA different?
GAIA’s philosophy marks a shift from traditional benchmarks. Here’s why:
Simple for humans, challenging for AIs
While some recent tests aim to challenge humans with ultra-complex tasks (for example, professional exams), GAIA takes the opposite approach. Its questions are intuitive for a non-expert human (92% success rate), but they highlight the gaps in today’s AI models. For example, GPT-4, even with plugins, does not exceed a 15% success rate.
Focus on the real world
GAIA isn’t limited to artificial environments or closed databases. It asks AIs to adapt to open-ended situations, such as searching for information on the web or interpreting various files (images, spreadsheets, etc.).
Robustness as an AGI criterion
GAIA’s creators believe that artificial general intelligence will not be measured only by the ability to outperform humans at niche tasks, but by robustness—that is, the ability to handle a wide range of problems with the same reliability as an average human.
How does GAIA work?
The benchmark is organized into three difficulty levels:
- Level 1: Questions accessible to the best language models (LLMs) with strong reasoning.
- Level 2: Tasks requiring more steps or tools.
- Level 3: Complex problems indicating a significant leap in AI capabilities.
Each question has a single, factual answer (a word, a number, or a short list), which makes automated, objective evaluation easier. Some include additional files (images, tables) to test multimodality.
Concrete examples of GAIA questions
Level 1: “What was the number of participants in a study mentioned on a specific website in 2022?”
- For a human: Go to the site, read the article, spot the number. Simple, with a bit of attention.
- For an AI: Requires browsing the web, finding the right page, and extracting the exact information. GPT-4, even with plugins, can fail due to misinterpretation or lack of precision.
Level 2: “How many images are in the 2022 version of the Wikipedia article on LEGO?”
- For a human: Open Wikipedia, count the images. Tedious but doable.
- For an AI: Requires understanding the question, accessing a specific version of the page, and correctly counting visual elements—a complex multimodal task.
Level 3: “Which city hosted Eurovision 2022 according to the official website?”
- For a human: Look up the official site, verify the info. A quick search is enough.
- For an AI: Requires precise browsing, handling reliable sources, and producing a correct synthesis—often out of reach for current models without adjustments.
What does GAIA tell us about AI today?
GAIA’s initial results are revealing:
- Humans: 92% success rate—proof that the questions are manageable for most people.
- GPT-4 with plugins: 15% success rate—a huge gap, despite this model’s advanced capabilities.
This gap shows that large language models (LLMs), while impressive in areas like text generation or academic tasks, still struggle to handle practical scenarios that require a combination of reasoning, adaptability, and interaction with the real world. Even with external tools (plugins), their limits in contextual understanding and smart resource use are clear.
Also read: Understanding Overfitting: When AI Learns Too Well!
GAIA’s strengths and limitations
Strengths:
- Practicality: The questions reflect real use cases for an AI assistant.
- Hard to game: Factual answers prevent models from being specifically “trained” to cheat.
- Interpretability: Its simplicity makes it easy to understand why an AI fails or succeeds.
Limitations:
- Web dependency: Some questions rely on online sources that may change or disappear over time.
- Creation cost: Each question requires about two hours of human work to design and validate.
- Lack of diversity: While varied, the questions may not cover all cultures or languages.
Why isn’t there more data on Grok or Gemini?
Although the GAIA benchmark has been tested with models like GPT-4, official data on other advanced AIs, such as xAI’s Grok 3 or Google’s Gemini, is still unavailable to date. This is because companies don’t systematically publish their results on public benchmarks like GAIA, or because the tests haven’t yet been carried out at scale.
The official Hugging Face leaderboard is evolving slowly, and while speculation on X mentions performance around 50–60% for these models, nothing is confirmed. In the meantime, human scores (92%) and those of the first tested models remain the main references for evaluating this benchmark.
Also read: GPT 4.5: The AI That Talks Like a Human?
GAIA and the future of AI
Since its release, GAIA has attracted interest from the scientific community and developers. A public leaderboard on Hugging Face makes it possible to track the performance of tested models. In December 2024, for example, H2O.ai’s h2oGPTe agent reached a score of 65%, outperforming competitors like Google (49%) or Microsoft (38%), but still far from the human 92%. This shows progress, but also how far there is still to go.
GAIA could become a standard for measuring progress toward AGI. By emphasizing robustness and versatility, it pushes researchers to rethink AI design beyond sheer raw power or hyper-specialized tasks.
For AI enthusiasts, it’s a fascinating tool to follow—a mirror of our expectations and a challenge for tomorrow’s machines. What do you think of this approach? Will AI ever reach GAIA’s 92%? The debate is open in our comments section!


