Introduction

As AI tools continue to gain traction, more developers are turning to them for help with writing and understanding code. From generating snippets to spotting bugs and suggesting fixes, AI is becoming a valuable part of the developer’s toolkit. But with so many models available—both open-source and proprietary—it’s important to evaluate how well they actually perform on real coding tasks. Not all models are created equal, and choosing the right one can make a big difference in productivity and reliability.

In this blog, we’ll explore key evaluation datasets used to assess LLMs across a range of coding tasks—from code completion to debugging. We’ll also take a look at how top open-source and proprietary models perform, helping you identify which ones excel in different areas. Whether you’re looking to catch bugs more efficiently, improve code quality, or speed up development, understanding model performance is a critical step.

TL;DR

Among the benchmarks presented, SWE-Bench (Verified) stands out as the most practical for assessing models on everyday software engineering tasks. Unlike synthetic or narrowly scoped benchmarks, SWE-Bench is grounded in real-world GitHub issues and requires models to reason about, edit, and integrate code within large, complex codebases. Its focus on realistic patch generation, cross-file dependencies, and actual bug fixes makes it the most representative benchmark for real-world developer workflows.

When evaluated specifically on SWE-Bench (Verified), Claude 3.7 Sonnet and Deepseek R1 stood out as the strongest performers among popular proprietary and open-source models, respectively—demonstrating notable effectiveness in handling realistic software engineering tasks.

Benchmark Datasets

Numerous benchmark datasets are designed to evaluate various aspects of coding performance, including code generation, bug fixing, codebase navigation, code comprehension and explanation, and real-time interactive development.

Table 1: Overview of Some of the Most Widely Used Benchmarks.

Benchmarks	Primary Task Type	What it Evaluates	Evaluated Language
HumanEval	Code Generation	Write correct standalone functions from a prompt with signature and docstring	Python
MBPP	Code Generation	Solve small programming problems using input/output examples	Python
SWE-Bench	Bug Fixing / Patching in Large Systems	Generate code edits that resolve real GitHub issues across large codebases	Python
LiveCodeBench	Interactive Coding & Debugging	Perform iterative code editing using compiler/test feedback loops	Python
MultiPL-E	Multi-language Code Generation	Write functions from descriptions in various programming languages	18+ languages (e.g: Bash, C++, Go, Java, JavaScript, R, Racket, Ruby, Rust, Typescript, etc…)
APPS	General Programming Problem Solving	Solve diverse problems with varying difficulty	Python
SWE-Lancer	Real-world Dev Tasks & Management	Fix bugs, implement features, and make engineering decisions	Python
SAFIM	Fill-in-the-Middle	Complete semantically meaningful code segments (blocks, conditions, API calls)	Python
HumanEvalExplain	Code Understanding & Regeneration	Explain code and regenerate it from explanation	Python, JavaScript, Java, Go, C++ and Rust

Popular Benchmarks

These benchmarks frequently appear in general LLM performance overviews that often include aspects of coding evaluation.

1. HumanEval

HumanEval is a benchmark introduced in OpenAI’s 2021 paper, “Evaluating Large Language Models Trained on Code,” developed in collaboration with engineers from OpenAI, Anthropic, and Zipline.

What it Assesses

It evaluates an LLM’s ability to write Python functions that solve specific tasks, testing language comprehension, reasoning, algorithms, and simple math.

Dataset Structure

HumanEval includes 164 handwritten programming problems. Each record contains:

Description (docstring describing expected behavior)
Test cases (an average of 7.7 unit tests to verify correctness)
Starter code (function signature)

Check out the full dataset on Huggingface.

2. Mostly Basic Python Problems (MBPP)

The Mostly Basic Python Problems Dataset (MBPP) was developed by the Google Research Team and introduced in the 2021 paper, “Program Synthesis with Large Language Models”.

What it Assesses

MBPP is similar to HumanEval but differs in the formatting of the prompts. Like HumanEval, it assesses an LLM’s ability to synthesize short functional Python programs based on a description.

Dataset Structure

MBPP consists of 974 crowd-sourced Python programming problems (426 of which were hand-verified and edited by the paper’s authors), designed to be solvable by entry-level programmers.

Each dataset record includes:

Description of problem
Test cases (3 automated test cases)
A code solution

Check out the full dataset on HuggingFace.

3. SWE-Bench

The SWE-Bench benchmark was introduced in the 2024 paper, “SWE-Bench: Can Language Models Resolve Real-World GitHub Issues?” and was developed by researchers at Princeton University and the University of Chicago.

What it Assesses

SWE-Bench assesses an LLM’s ability to provide patches and feature requests on full GitHub repositories. This dataset evaluates models on large systems and multi-file environments, requiring them to navigate complex codebases, understand interactions between different files, identify errors, and generate code patches. It simulates a realistic software engineering environment, making it highly applicable to real-world applications.

Dataset Structure

The full SWE-Bench dataset contains 2,294 unique software engineering (SE) issues sourced from GitHub. The SWE-Bench Verified dataset is a subset of the original, containing 500 samples that have been human-verified for quality. This version is typically used in technical reports evaluating LLMs.

Each dataset record includes:

Description of the problem
Test cases (cases to pass, cases to fail)
The codebase (repository, base commit)
Hints (though hints are not allowed when evaluating models)

Check out the full SWE-Bench Verified dataset on HuggingFace.

4. LiveCodeBench

LiveCodeBench was introduced in the 2024 paper “LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code” by researchers from UC Berkeley, MIT and Cornell.

LiveCodeBench was developed to address the limitations of older benchmarks like MBPP and HumanEval, which were often too simple and not as practical for real-world coding scenarios, especially for evaluating newer and better models.

What it Assesses

It assesses the following four aspects of model performance

The benchmark evaluates the following four aspects of model performance:

Code generation: The model generates code from natural language problem statements and is evaluated against unseen test cases.
Self-repair: The model is given a natural language problem and a self-generated candidate program. If the program contains an error, the model receives feedback and is tasked with producing a corrected solution.
Code execution: The model must predict the output given an input and a program snippet.
Test case output prediction: The model must predict the output given an input and a natural language description of the problem (without access to the function implementation).

Dataset Structure

The LiveCodeBench dataset consists of 500+ example problems sourced from LeetCode, AtCoder, and CodeForces and are evaluated against a set of test cases. These problems are intended to test models on their ability to handle a range of coding tasks and provide holistic evaluations of their capabilities.

Check out the full dataset on Huggingface.

Additional Noteworthy Benchmarks

These benchmarks provide a deeper dive into evaluating various aspects of coding and tackle more complex tasks compared to the more widely known, older benchmarks.

5. MultiPL-E

MultiPL-E is a benchmark introduced in the 2023 paper, “MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation” that originally extended the MBPP and HumanEval datasets to evaluate code generation across an additional 18+ programming languages.

What it Assesses

MultiPL-E evaluates an LLM’s ability to generate code in multiple programming languages, expanding upon the tasks from MBPP and HumanEval to cover a broad range of languages. It currently supports 22 languages.

Dataset Structure

MultiPL-E consists of MBPP and HumanEval tasks across multiple programming languages, including Python, Java, JavaScript, C++, and more.

Each dataset record includes:

Description of the problem
Test cases
The target solution language

Check out the full dataset on HuggingFace.

6. APPS (Automated Programming Process Standard)

APPS is a benchmark introduced in the 2021 paper, “Measuring Coding Challenge Competence With APPS” by researchers at UC Berkeley, UChicago, UIUC, and Cornell.

What it Assesses

It evaluates an LLM’s ability to understand problem statements and generate correct code implementations, with problems categorized into different difficulty levels ranging from basic scripting tasks to advanced algorithmic challenges.

Dataset Structure

APPS consists of 10,000 coding problems sourced from Codewars, AtCoder, Kattis, and Codeforces, organized into three difficulty levels:

Simple introductory problems (3,639 problems for beginner programmers)
Interview-level problems (5,000 more algorithmic problems)
Coding competition challenges (1,361 problems from USACO, IOC, and ACM)

Each dataset record includes:

Description of task with example inputs and expected outputs
Test cases, averaging to 21.2 test cases per record
Potential starter code (such as a function header), though not all records include this

Check out the full dataset on HuggingFace.

7. SWE-Lancer

SWE-Lancer is a benchmark introduced in the 2025 paper, “SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?” by researchers at OpenAI.

What it Assesses

It assesses an LLM’s ability to provide bug fixes and/or feature implementations and even perform managerial tasks, where models must evaluate technical implementation proposals.

The dataset consists of both individual contributor (IC) tasks and managerial tasks. IC tasks focus on creating patches or implementing features, while managerial tasks involve evaluating freelancer proposals and selecting the best one. Grading for IC tasks is based on end-to-end tests and decisions verified by experienced software engineers, while managerial tasks are assessed against the choices of the original engineering managers. The overall evaluation is determined by the percentage of tasks solved and the corresponding payout earned by the model, using real freelance rates, with a total payout of up to $1 million.

Dataset Structure

SWE-Lancer contains over 1,400 freelance software engineering tasks sourced from Upwork. Each dataset record includes:

Description of issue
End-to-end tests or managerial task correct choices
Associated payout price

Check out this GitHub repository to view the dataset.

8. SAFIM (Syntax-Aware Fill-in-the-Middle)

SAFIM is a benchmark introduced in the 2025 paper “Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks” by researchers from the University of California at Berkeley and AI Meta.

What it Assesses

SAFIM evaluates an LLM’s ability to perform code Fill-in-the-Middle (FIM) tasks.

In these tasks, the model is given the beginning and end of a code snippet and must correctly “fill in the middle” — a challenge that requires more than simple autocomplete. Unlike traditional code completion tasks that predict the next token or line, FIM tasks require the model to understand context from both ends and generate syntactically correct, logically coherent code.

This benchmark emphasizes syntax-aware completions for critical program structures such as code blocks and conditional expressions. FIM tasks are designed to work with syntactic units rather than filling in randomly masked lines. SAFIM includes three primary subtasks:

Algorithmic Block Completion: Completing a critical code block for solving the problem.
Control-Flow Expression Completion: Completing key conditional expressions essential for the program’s logic.
API Function Call Completion: Completing function calls or object constructor calls from popular API libraries, with hints provided as necessary.

Evaluations are conducted by applying unit tests (execution-based testing) or checking for syntax matching against the ground truth.

Dataset Structure

SAFIM includes 17,720 examples from multiple programming languages, sourced from platforms like Codeforces and GitHub.

Each dataset record includes:

A prompt (Code to Fill-in-the-Middle), presented in five different forms to account for the influence of prompt design on model performance.
Unit tests
Ground truth

Check out the full dataset on HuggingFace.

9. HumanEvalExplain

HumanEvalExplain is part of the HumanEvalPack, which extends the HumanEval dataset to 3 additional scenarios across 6 languages (Python, Javascript, Java, Go, C++ and Rust).

What it Assesses

The HumanEvalExplain assesses an LLM’s ability to not only understand code but also explain it and then regenerate the code from its own explanation. This task involves two runs: one to generate the explanation and another to regenerate the solution based on that explanation.

This benchmark can provide insights into how a model handles tasks that convert code into text, such as explaining code, generating docstrings, or adding comments and can help with improving code clarity.

Check out the full dataset on HuggingFace.

Model Evaluations

Now that we’ve discussed some of the key benchmark datasets, let’s dive into some state-of-the-art (SOTA) proprietary and open-source LLMs, which are specifically designed for coding tasks as well as general tasks that can be applied to coding.

For further details on these models, refer to the Appendix. You can also explore the lm-evaluation-harness to evaluate popular metrics on both pre-trained and custom models yourself.

Evaluation Results

This section presents reported performance metrics across the previously introduced benchmarks.

For HumanEval, MBPP, SWE-Bench (Verified), and LiveCodeBench, we report the pass@1 metric, which represents the percentage of tasks where a correct solution is generated on the first attempt. A solution is considered correct if it passes all the provided test cases for the corresponding problem.

Proprietary Models

Table 2: Performance of proprietary LLMs over various datasets, measured by pass@1.

Model	HumanEval	SWE-Bench (Verified)	LiveCodeBench
OpenAI o1	—	48.9%	63.4%
OpenAI o1-mini	92.4%	41.6%	53.8%
OpenAI o3-mini (high)	97.6%	49.3%	74.1%
OpenAI 4o	90.2%	38.8%	34.2%
OpenAI 4o-mini	87.2%	—	23.0%
Claude 3 Opus	84.9%	11.7%	34.6%
Claude 3.7 Sonnet	97.8%	70.3%	—
Claude 3 Haiku	75.9%	—	—
Google Gemini 2.5 pro	98.5%	63.8%	70.4%
Codestral 22B	81.1%	—	31.0%
Mistral Large 2	89.8%	—	29.3%

Performance of proprietary LLMs over various datasets, measured by pass@1

Figure 1: Performance of proprietary LLMs over various datasets, measured by pass@1.

Open-source Models

Table 3: Performance of open-source LLMs over various datasets, measured by pass@1.

Model	HumanEval	MBPP	SWE-Bench (Verified)	LiveCodeBench
Google Gemma 3 27B	48.8%	65.6%	—	—
Google Codegemma 7B	44.5%	56.2%	—	—
Deepseek R1	—	—	49.2%	65.9%
Deepseek V3	82.6%	—	42.0%	—
Deepseek Coder-V2 Instruct	90.2%	—	12.7%	43.4%
Qwen2.5 72B Inst	80.4%	—	23.8%	—
Qwen 2.5-coder 32B Inst	92.7%	90.2%	31.4%	—
Codegeex4-all-9B	82.3%	75.7%	—	—

Performance of open-source LLMs over various datasets, measured by pass@1

Figure 2: Performance of open-source LLMs over various datasets, measured by pass@1.

Evaluation Analysis

Here are the top performing models for each benchmark.

HumanEval

Top Performing Models: Gemini 2.5 Pro (proprietary), Qwen 2.5-coder 32B Inst (open-source)
What it Means: The models showed strong proficiency in small-scale code generation tasks, excelling in providing precise solutions for brief problem statements.

MBPP

Top Performing Models: Qwen 2.5-coder 32B Inst (open-source)
What it Means: This model showed excellent performance in its ability to solve basic coding problems typically handled by entry-level programmers.

SWE-Bench

Top Performing Models: Claude 3.7 Sonnet (proprietary), Deepseek R1 (open-source)
What it Means: These models showed impressive skills in managing real-world software engineering tasks. They can handle complex software engineering issues, including understanding and solving problems across various codebases, which makes it ideal for large-scale, multifaceted tasks.

LiveCodeBench

Top Performers: o3-mini High (proprietary), Deepseek R1 (open-source)
What it Means: These models excel in tasks involving code debugging, execution, and output prediction.

Summary:

Gemini 2.5 Pro (proprietary): Excels at precise, small-scale code generation and solving well-defined programming tasks.
Claude 3.7 Sonnet (proprietary): Excels at complex software engineering problems and understanding real-world issues across large codebases.
GPT-4o-mini (proprietary): Excels at live/intermediate reasoning and multi-turn programming tasks requiring iteration and debugging.
Qwen 2.5-coder 32B Inst (open-source): Excels in generating accurate code for both simple and moderately complex problems; ideal for entry-level and intermediate programming tasks.
Deepseek R1 (open-source): Strong real-world coding capabilities, with good performance on both engineering-grade and interactive, iterative coding problems.

Conclusion

While benchmarks like HumanEval, SWE-Bench, and LiveCodeBench offer valuable insights into model strengths across a range of coding tasks, they capture only a slice of overall performance. Real-world software development is far more complex and can depend on other factors.

It’s also important to recognize that both models and benchmarks are evolving rapidly. New datasets are regularly introduced, and existing ones are continuously refined to better mirror real-world challenges.

To stay up to date on the newest models for coding tasks, check out popular leaderboards like the BigCodeBench Leaderboard, LiveCodeBench Leaderboard, and EvalPlus Leaderboard, which track model performance across a wide range of coding tasks.

Appendix

Here’s some additional information about the models selected for evaluation.

OpenAI Models

o1: A reasoning-focused model tuned for complex problem solving, coding, and planning.
o1-mini: A lighter version of o1 optimized for efficiency.
o3-mini: A model optimized for reasoning and coding, offering better performance than o1-mini, while also being light-weight
4o: OpenAI’s flagship multimodal model for fast, accurate, and interactive tasks across text, audio, and vision.
4o-mini: A smaller, efficient version of GPT-4o

Claude

Claude 3 Opus: Anthropic’s most advanced and capable model, designed for top-tier performance in complex tasks like research, strategy, and high-level reasoning.
Claude 3.5 Haiku: Anthropic’s fastest model, optimized for speed and efficiency while still offering advanced coding, tool use, and reasoning capabilities.
Claude 3.7 Sonnet: A balance of intelligence and speed, making it suitable for a wide range of tasks with solid performance and responsiveness.

Google

Gemini 2.5 Pro: Google DeepMind’s top proprietary model for coding and complex tasks, designed for multimodal reasoning and large-scale AI applications.
Gemma 3 27B: Google DeepMind’s latest open-weight model for advanced text generation and reasoning, built for flexibility across devices and research use.
CodeGemma 7B: A text-to-code model from Google DeepMind, fine-tuned for pure code generation without instruction tuning.

Deepseek

DeepSeek R1: An open-source LLM focused on advanced reasoning tasks like math, coding, and logic.
DeepSeek V3: A large-scale model designed for advanced tasks including chat, coding, and general reasoning.
Coder V2: A specialized model fine-tuned for coding tasks and excelling in coding and mathematical reasoning, trained on a diverse corpus of source code, math, and natural language data.

CodeGeeX4-All

CodeGeeX-4-all: A model specifically fine-tuned to assist with code generation, completion, and problem-solving, optimizing productivity for developers across a variety of programming languages.

Mistral

Codestral: A language model optimized for coding, specializing in low-latency, high-frequency tasks like fill-in-the-middle (FIM), code correction, and test generation.
Mistral-Large: A top-tier reasoning model designed for high-complexity tasks requiring advanced logical analysis and problem-solving.

Qwen

Qwen2.5: A model excelling in natural language understanding and generation.
Qwen 2.5 Coder: A model tailored for coding tasks, leveraging 32 billion parameters to assist with code generation, debugging, and problem-solving.

Sources of reported metric values

Articles and Company Webpages

GitHub Links

Huggingface

Others

https://artificialanalysis.ai/models/claude-3-7-sonnet

Benchmarking Performance of LLMs for Coding

Introduction

TL;DR

Benchmark Datasets

Popular Benchmarks

1. HumanEval

2. Mostly Basic Python Problems (MBPP)

3. SWE-Bench

4. LiveCodeBench

Additional Noteworthy Benchmarks

5. MultiPL-E

6. APPS (Automated Programming Process Standard)

7. SWE-Lancer

8. SAFIM (Syntax-Aware Fill-in-the-Middle)

9. HumanEvalExplain

Model Evaluations

Evaluation Results

Proprietary Models

Open-source Models

Evaluation Analysis

Summary:

Conclusion

Appendix

Sources of reported metric values