Introduction

Large language models (LLMs) have come a long way—from just generating text to now reasoning, planning, and taking actions. As these capabilities grow, so does the interest in using LLMs as agents: autonomous systems that can interact with the world, make decisions, and complete complex tasks. From navigating websites and manipulating files to querying databases and playing games, LLM agents are becoming central to a wide range of real-world AI applications.

As these capabilities grow, so does the need to rigorously evaluate them. Benchmarks provide structured, measurable ways to assess how well LLM agents perform across different domains like web environments, tool use, planning, and reasoning.

In this blog, we’ll walk through some of the most widely used and influential benchmarks for evaluating LLM agents, breaking down what they measure, how they work, and what makes each of them unique.

What are LLM Agents?

LLM agents are systems powered by large language models designed to interact with external tools or systems, such as databases, websites, and games, to accomplish specific goals. They combine reasoning, planning, and execution by analyzing problems, formulating solutions, and carrying them out through actions like generating function calls, crafting API requests, or providing text instructions for simulated tasks (e.g., clicks on a website).

Potential tasks for LLM agents could include scenarios like “find the cheapest flight from New York to London,” where the agent queries a flight database and executes the solution by generating API requests to retrieve flight options. Other examples involve tasks such as “navigate a website to complete an online purchase” or “analyze data from a database to generate a report.”

Language Models and Benchmarking Overview

LLM Agent Benchmarks

For an LLM to work effectively as an agent, it needs to have strong reasoning capabilities and the ability to generate accurate code or function calls with the correct parameters. This makes benchmarks essential for evaluating how well LLMs perform in these areas, helping us understand their strengths and limitations when acting as autonomous agents.

Key Benchmarks for Evaluating LLM Agent Performance in Real-World Tasks

Benchmark	Evaluation Aspect	Description	Evaluation method
AgentBench	Language understanding, Planning, Reasoning, Decision making, Tool-calling, Multi-turn	Evaluates LLM agents across various real-world agent tasks (e.g., browsing, coding, KG querying).	Success rates, reward scores, F1 score, etc…
AgentBoard	Planning, Reasoning, Tool-calling, Language understanding, Decision making, Multi-turn	Evaluates LLM agents in complex, multi-turn tasks across physical, game, web, and tool-based environments.	Step-by-step evaluation (fine-grained progress rate, task completion rate)
Berkeley Function Calling Leaderboard Benchmark	Tool-calling	Focuses on structured function-call generation	AST matching, output correctness, API response structure
GAIA	Reasoning, Multi-modality handling, Web browsing, Tool-calling, Multi-turn	Evaluates AI assistants’ ability to reason, browse the web, and use multiple tools across various tasks.	Quasi Exact Match
Stable ToolBench	Tool-calling (Multi-tool Scenarios)	Evaluates tool-augmented agents in stable, reproducible virtual API settings, including multi-tool use.	Pass rate, win rate

1. AgentBench

AgentBench tasks overview. Source: AgentBench: Evaluating LLMs as Agents paper

AgentBench, introduced in the 2023 paper “AgentBench: Evaluating LLMs as Agents”, evaluates the reasoning, decision-making, and task execution capabilities of LLMs across eight open-ended, multi-turn task environments. These environments are designed to test various aspects of agentic performance, such as language understanding, planning, and tool interaction.

The benchmark focuses on three main domains: code, games, and the web. In more detail, these environments are:

Operating Systems: Evaluates LLMs’ ability to generate correct shell commands for deterministic tasks. Success rate is measured by the model’s correctness, determined by whether its answer and output, when processed through a pipeline of scripts, results in a successful exit.
Database: Assesses the LLM’s skill in generating SQL queries for database operations (select, insert, update operations). Success rate is calculated by comparing select-operation outputs or modified table hashes to expected results.
Knowledge Graphs: Tests the ability to interact with knowledge graphs to answer complex queries. F1 score, Exact Match (of logical forms), and executability (whether the model’s output leads to a valid executable result) are assessed.
Digital Card Games: Measures strategic thinking and multi-turn planning in a turn-based card game called Aquawar. Evaluates a reward score using number of winning rounds, total rounds played, and health damage rate.
Lateral Thinking Puzzles: Evaluates lateral reasoning by solving puzzles through yes/no questions. Four metrics are used: single game accuracy, round efficiency (how quickly it solves), query relevance (how related each question is to the target answer), and game progress (the main metric, measuring how many key checkpoints of the ground-truth scenario are correctly identified).
Householding: Assesses practical reasoning by completing tasks in a simulated household environment. The model must decompose high-level instructions into actionable steps. The success rate is calculated as the number of tasks successfully completed over the total attempted.
Web Shopping: Tests decision-making and goal-directed reasoning in e-commerce scenarios where LLMs must navigate and select suitable products based on a natural language shopping instruction (e.g., “Find a stylish black backpack under $100”). The evaluation metric is based on a reward score.
Web Browsing: Evaluates LLMs’ ability to navigate real websites and execute complex instructions, such as “Check for pickup restaurants available in Boston, NY on March 18, 5 pm with just one guest.” The step success rate measures how accurately each individual action (like clicking a button or entering text) aligns with the expected step toward task completion.

2. AgentBoard

AgentBoard tasks overview. Source: AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents paper

AgentBoard was introduced in the 2024 paper, “AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents”.

Like AgentBench, it’s a multi-faceted benchmark. AgentBoard covers four main domains—embodied (test agents in simulated physical environments), game, web, and **tool (**evaluate the agent’s use of external tools).

Embodied: ALFWorld tasks agents with commonsense-driven household tasks using text-based navigation; ScienceWorld emphasizes multi-step scientific procedures; BabyAI tests fine-grained instruction following in a gridworld.
Game: Jericho involves fantasy-based text adventures with shortened subgoal chains; PDDL (Planning Domain Definition Language) evaluates planning and logic in strategic games.
Web: WebShop simulates online shopping tasks; WebArena challenges agents with complex real-site navigation and action sequences.
Tool: Tool-Query requires structured querying across domains like weather, movies, and academia; Tool-Operation focuses on precise updates in productivity apps such as to-do lists and spreadsheets.

Unlike many other benchmarks, AgentBoard uses a fine-grained progress rate to capture step-by-step performance rather than just final outcomes. This metric standardizes evaluation across diverse task types, enabling more meaningful comparisons and averaging.

Check out the AgentBoard leaderboard here.

3. Berkeley Function Calling Leaderboard

BFCL tasks and data overview. Source: Gorilla: Large Language Model Connected with Massive APIs blog

The latest version of the Berkeley Function Calling Leaderboard (BFCL) was introduced in the 2024 blog post, “Gorilla: Large Language Model Connected with Massive APIs” by researchers from UC Berkeley.

Sourced from real-world user data, BFCL includes 4,751 tasks across function calling, REST APIs, SQL, and function relevance detection. Most tasks use Python, with some in Java and JavaScript to assess generalization. Tasks are single-turn or multi-turn, and multi-turn problems are split into base and augmented variants—the latter introducing challenges like missing parameters, missing functions, or long context.

Function correctness is evaluated through AST and output matching.
REST API calls are judged on status codes, response structure, and JSON key consistency.
SQL tasks test query construction
Function relevance tasks assess whether the model can identify usable functions or return nothing if none apply—helping detect hallucinations.

Check out the BFCL leaderboard here.

4. GAIA

GAIA data overview. Source: GAIA: A Benchmark for General AI Assistants paper

GAIA (A Benchmark for General AI Assistants) was introduced in the 2023 paper, “GAIA: A Benchmark for General AI Assistants” by researchers from Meta, Huggingface, and AutoGPT. It presents real-world questions designed to test an AI assistant’s ability to reason, use tools, handle multiple modalities, and browse the web—tasks that are easy for humans but remain challenging for AI.

GAIA covers five core capabilities: web browsing, multimodal understanding (e.g., speech, video, image), code execution, diverse file reading (e.g., PDFs, Excel), and tasks solvable without tools (e.g., translations or spell-checking).

Each GAIA question has a single correct answer in a simple format (e.g., a string, number, or comma-separated list), and tasks are grouped into three difficulty levels:

Level 1 requires one or no tools in no more than 5 steps
Level 2 involves moderate multi-tool use in 5 to 10 steps
Level 3 questions are for near perfect general assistants, involving long sequences of actions using any number of tools

Check out the GAIA leaderboard here.

5. Stable ToolBench

Stable ToolBench data construction, ToolLLaMa (model trained on ToolBench) training and inference. Source: StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models paper

Stable ToolBench is an updated benchmark built upon ToolBench, introduced in the 2025 paper, “StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models” by researchers from Tsinghua University, 01.AI, Google, the University of Hong Kong, and the Jiangsu Innovation Center for Language Competence.

The benchmark employs virtual APIs within a controlled system to ensure stability and reproducibility. LLMs simulate API behaviors by first querying the cache of real API calls. If the requested behavior is not found in the cache (a cache miss), the system then returns simulated output using documentation and few-shot real API calls to replicate the API’s response.

The dataset consists of instructions generated from collected APIs for both single-tool and multi-tool scenarios, assessing an LLM’s ability to interact with individual tools and combine them for complex task completion.

Evaluations on this benchmark focus on two key metrics: the pass rate, which gauges an LLM’s ability to execute an instruction within set budgets, and the win rate, which compares the quality of the LLM’s solution path to that generated by gpt-3.5-turbo.

Additional Benchmarks

Here are several other notable benchmarks designed to evaluate various aspects of LLM performance across different domains and task types:

WorkBench: A benchmark focused on workplace tasks, such as sending emails and interacting with simulated task databases in areas like calendar management, email, web analytics, customer relationship management, and project management. It includes multi-step tasks that require utilizing multiple tools across domains, and it evaluates the outcomes by comparing the actual database results with expected ones.
Tau-Bench: Similar to WorkBench, this benchmark simulates APIs using databases and compares the actual database results to the expected ones. Tau-Bench focuses on two specific domains: retail and airline data.
Nexus Function Calling Benchmark: This benchmark evaluates various types of function calls—single, nested, and parallel—using eight publicly available datasets, each representing a real-world API. It tests how well LLMs can execute these function calls based on practical API scenarios.
ToolACE: Emphasizing the importance of diverse and multiple API/tool interactions, ToolACE synthesizes a broad pool of 26,507 APIs by breaking down and sampling domain-specific documentation, tutorials, and guides to generate new APIs. This approach tests LLMs’ ability to work with a large variety of tools across different domains.

Conclusion

In conclusion, the expanding landscape of benchmarks for evaluating large language models (LLMs) has introduced a variety of tasks designed to test critical capabilities such as function calling, multi-step reasoning, and tool integration. These benchmarks are crucial for assessing LLM performance in different contexts, pushing the boundaries of what these models can achieve in real-world scenarios. From evaluating their ability to handle simple tasks to more complex, multi-domain challenges, they play a key role in guiding future advancements.

As LLMs continue to evolve, these benchmarks remain essential for providing insights into model strengths and weaknesses. They offer a clearer understanding of how well LLMs can interact with tools, databases, and diverse problem domains, ensuring that their development leads to more reliable and effective models for practical applications.

References

GAIA: A Benchmark for General AI Assistants

StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models

WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting

Tau-Bench: A Benchmark for Evaluating LLM-based Agents for Multi-step Reasoning

AgentBench: Evaluating LLMs as Agents

AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents

ToolACE: Winning the Points of LLM Function Calling

Introduction to LLM Agents

Gorilla: Large Language Model Connected with Massive APIs

Nexus Function Calling Benchmark