GenerativeModels.ai
Blog

Benchmarking Performance of LLMs for Coding

R&D Team

Introduction

As AI tools continue to gain traction, more developers are turning to them for help with writing and understanding code. From generating snippets to spotting bugs and suggesting fixes, AI is becoming a valuable part of the developer’s toolkit. But with so many models available—both open-source and proprietary—it’s important to evaluate how well they actually perform on real coding tasks. Not all models are created equal, and choosing the right one can make a big difference in productivity and reliability.

In this blog, we’ll explore key evaluation datasets used to assess LLMs across a range of coding tasks—from code completion to debugging. We’ll also take a look at how top open-source and proprietary models perform, helping you identify which ones excel in different areas. Whether you’re looking to catch bugs more efficiently, improve code quality, or speed up development, understanding model performance is a critical step.

TL;DR

Among the benchmarks presented, SWE-Bench (Verified) stands out as the most practical for assessing models on everyday software engineering tasks. Unlike synthetic or narrowly scoped benchmarks, SWE-Bench is grounded in real-world GitHub issues and requires models to reason about, edit, and integrate code within large, complex codebases. Its focus on realistic patch generation, cross-file dependencies, and actual bug fixes makes it the most representative benchmark for real-world developer workflows.

When evaluated specifically on SWE-Bench (Verified), Claude 3.7 Sonnet and Deepseek R1 stood out as the strongest performers among popular proprietary and open-source models, respectively—demonstrating notable effectiveness in handling realistic software engineering tasks.

Benchmark Datasets

Numerous benchmark datasets are designed to evaluate various aspects of coding performance, including code generation, bug fixing, codebase navigation, code comprehension and explanation, and real-time interactive development.

Table 1: Overview of Some of the Most Widely Used Benchmarks.

BenchmarksPrimary Task TypeWhat it EvaluatesEvaluated Language
HumanEvalCode GenerationWrite correct standalone functions from a prompt with signature and docstringPython
MBPPCode GenerationSolve small programming problems using input/output examplesPython
SWE-BenchBug Fixing / Patching in Large SystemsGenerate code edits that resolve real GitHub issues across large codebasesPython
LiveCodeBenchInteractive Coding & DebuggingPerform iterative code editing using compiler/test feedback loopsPython
MultiPL-EMulti-language Code GenerationWrite functions from descriptions in various programming languages18+ languages (e.g: Bash, C++, Go, Java, JavaScript, R, Racket, Ruby, Rust, Typescript, etc…)
APPSGeneral Programming Problem SolvingSolve diverse problems with varying difficultyPython
SWE-LancerReal-world Dev Tasks & ManagementFix bugs, implement features, and make engineering decisionsPython
SAFIMFill-in-the-MiddleComplete semantically meaningful code segments (blocks, conditions, API calls)Python
HumanEvalExplainCode Understanding & RegenerationExplain code and regenerate it from explanationPython, JavaScript, Java, Go, C++ and Rust

These benchmarks frequently appear in general LLM performance overviews that often include aspects of coding evaluation.

1. HumanEval

HumanEval is a benchmark introduced in OpenAI’s 2021 paper, “Evaluating Large Language Models Trained on Code,” developed in collaboration with engineers from OpenAI, Anthropic, and Zipline.

What it Assesses

It evaluates an LLM’s ability to write Python functions that solve specific tasks, testing language comprehension, reasoning, algorithms, and simple math.

Dataset Structure

HumanEval includes 164 handwritten programming problems. Each record contains:

Check out the full dataset on Huggingface.

2. Mostly Basic Python Problems (MBPP)

The Mostly Basic Python Problems Dataset (MBPP) was developed by the Google Research Team and introduced in the 2021 paper, “Program Synthesis with Large Language Models”.

What it Assesses

MBPP is similar to HumanEval but differs in the formatting of the prompts. Like HumanEval, it assesses an LLM’s ability to synthesize short functional Python programs based on a description.

Dataset Structure

MBPP consists of 974 crowd-sourced Python programming problems (426 of which were hand-verified and edited by the paper’s authors), designed to be solvable by entry-level programmers.

Each dataset record includes:

Check out the full dataset on HuggingFace.

3. SWE-Bench

The SWE-Bench benchmark was introduced in the 2024 paper, “SWE-Bench: Can Language Models Resolve Real-World GitHub Issues?” and was developed by researchers at Princeton University and the University of Chicago.

What it Assesses

SWE-Bench assesses an LLM’s ability to provide patches and feature requests on full GitHub repositories. This dataset evaluates models on large systems and multi-file environments, requiring them to navigate complex codebases, understand interactions between different files, identify errors, and generate code patches. It simulates a realistic software engineering environment, making it highly applicable to real-world applications.

Dataset Structure

The full SWE-Bench dataset contains 2,294 unique software engineering (SE) issues sourced from GitHub. The SWE-Bench Verified dataset is a subset of the original, containing 500 samples that have been human-verified for quality. This version is typically used in technical reports evaluating LLMs.

Each dataset record includes:

Check out the full SWE-Bench Verified dataset on HuggingFace.

4. LiveCodeBench

LiveCodeBench was introduced in the 2024 paper “LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code” by researchers from UC Berkeley, MIT and Cornell.

LiveCodeBench was developed to address the limitations of older benchmarks like MBPP and HumanEval, which were often too simple and not as practical for real-world coding scenarios, especially for evaluating newer and better models.

What it Assesses

It assesses the following four aspects of model performance

The benchmark evaluates the following four aspects of model performance:

Dataset Structure

The LiveCodeBench dataset consists of 500+ example problems sourced from LeetCode, AtCoder, and CodeForces and are evaluated against a set of test cases. These problems are intended to test models on their ability to handle a range of coding tasks and provide holistic evaluations of their capabilities.

Check out the full dataset on Huggingface.

Additional Noteworthy Benchmarks

These benchmarks provide a deeper dive into evaluating various aspects of coding and tackle more complex tasks compared to the more widely known, older benchmarks.

5. MultiPL-E

MultiPL-E is a benchmark introduced in the 2023 paper, “MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation” that originally extended the MBPP and HumanEval datasets to evaluate code generation across an additional 18+ programming languages.

What it Assesses

MultiPL-E evaluates an LLM’s ability to generate code in multiple programming languages, expanding upon the tasks from MBPP and HumanEval to cover a broad range of languages. It currently supports 22 languages.

Dataset Structure

MultiPL-E consists of MBPP and HumanEval tasks across multiple programming languages, including Python, Java, JavaScript, C++, and more.

Each dataset record includes:

Check out the full dataset on HuggingFace.

6. APPS (Automated Programming Process Standard)

APPS is a benchmark introduced in the 2021 paper, “Measuring Coding Challenge Competence With APPS” by researchers at UC Berkeley, UChicago, UIUC, and Cornell.

What it Assesses

It evaluates an LLM’s ability to understand problem statements and generate correct code implementations, with problems categorized into different difficulty levels ranging from basic scripting tasks to advanced algorithmic challenges.

Dataset Structure

APPS consists of 10,000 coding problems sourced from Codewars, AtCoder, Kattis, and Codeforces, organized into three difficulty levels:

Each dataset record includes:

Check out the full dataset on HuggingFace.

7. SWE-Lancer

SWE-Lancer is a benchmark introduced in the 2025 paper, “SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?” by researchers at OpenAI.

What it Assesses

It assesses an LLM’s ability to provide bug fixes and/or feature implementations and even perform managerial tasks, where models must evaluate technical implementation proposals.

The dataset consists of both individual contributor (IC) tasks and managerial tasks. IC tasks focus on creating patches or implementing features, while managerial tasks involve evaluating freelancer proposals and selecting the best one. Grading for IC tasks is based on end-to-end tests and decisions verified by experienced software engineers, while managerial tasks are assessed against the choices of the original engineering managers. The overall evaluation is determined by the percentage of tasks solved and the corresponding payout earned by the model, using real freelance rates, with a total payout of up to $1 million.

Dataset Structure

SWE-Lancer contains over 1,400 freelance software engineering tasks sourced from Upwork. Each dataset record includes:

Check out this GitHub repository to view the dataset.

8. SAFIM (Syntax-Aware Fill-in-the-Middle)

SAFIM is a benchmark introduced in the 2025 paper “Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks” by researchers from the University of California at Berkeley and AI Meta.

What it Assesses

SAFIM evaluates an LLM’s ability to perform code Fill-in-the-Middle (FIM) tasks.

In these tasks, the model is given the beginning and end of a code snippet and must correctly “fill in the middle” — a challenge that requires more than simple autocomplete. Unlike traditional code completion tasks that predict the next token or line, FIM tasks require the model to understand context from both ends and generate syntactically correct, logically coherent code.

This benchmark emphasizes syntax-aware completions for critical program structures such as code blocks and conditional expressions. FIM tasks are designed to work with syntactic units rather than filling in randomly masked lines. SAFIM includes three primary subtasks:

Evaluations are conducted by applying unit tests (execution-based testing) or checking for syntax matching against the ground truth.

Dataset Structure

SAFIM includes 17,720 examples from multiple programming languages, sourced from platforms like Codeforces and GitHub.

Each dataset record includes:

Check out the full dataset on HuggingFace.

9. HumanEvalExplain

HumanEvalExplain is part of the HumanEvalPack, which extends the HumanEval dataset to 3 additional scenarios across 6 languages (Python, Javascript, Java, Go, C++ and Rust).

What it Assesses

The HumanEvalExplain assesses an LLM’s ability to not only understand code but also explain it and then regenerate the code from its own explanation. This task involves two runs: one to generate the explanation and another to regenerate the solution based on that explanation.

This benchmark can provide insights into how a model handles tasks that convert code into text, such as explaining code, generating docstrings, or adding comments and can help with improving code clarity.

Check out the full dataset on HuggingFace.

Model Evaluations

Now that we’ve discussed some of the key benchmark datasets, let’s dive into some state-of-the-art (SOTA) proprietary and open-source LLMs, which are specifically designed for coding tasks as well as general tasks that can be applied to coding.

For further details on these models, refer to the Appendix. You can also explore the lm-evaluation-harness to evaluate popular metrics on both pre-trained and custom models yourself.

Evaluation Results

This section presents reported performance metrics across the previously introduced benchmarks.

For HumanEval, MBPP, SWE-Bench (Verified), and LiveCodeBench, we report the pass@1 metric, which represents the percentage of tasks where a correct solution is generated on the first attempt. A solution is considered correct if it passes all the provided test cases for the corresponding problem.

Proprietary Models

Table 2: Performance of proprietary LLMs over various datasets, measured by pass@1.

ModelHumanEvalSWE-Bench (Verified)LiveCodeBench
OpenAI o148.9%63.4%
OpenAI o1-mini92.4%41.6%53.8%
OpenAI o3-mini (high)97.6%49.3%74.1%
OpenAI 4o90.2%38.8%34.2%
OpenAI 4o-mini87.2%23.0%
Claude 3 Opus84.9%11.7%34.6%
Claude 3.7 Sonnet97.8%70.3%
Claude 3 Haiku75.9%
Google Gemini 2.5 pro98.5%63.8%70.4%
Codestral 22B81.1%31.0%
Mistral Large 289.8%29.3%

Performance of proprietary LLMs over various datasets, measured by pass@1

Figure 1: Performance of proprietary LLMs over various datasets, measured by pass@1.

Open-source Models

Table 3: Performance of open-source LLMs over various datasets, measured by pass@1.

ModelHumanEvalMBPPSWE-Bench (Verified)LiveCodeBench
Google Gemma 3 27B48.8%65.6%
Google Codegemma 7B44.5%56.2%
Deepseek R149.2%65.9%
Deepseek V382.6%42.0%
Deepseek Coder-V2 Instruct90.2%12.7%43.4%
Qwen2.5 72B Inst80.4%23.8%
Qwen 2.5-coder 32B Inst92.7%90.2%31.4%
Codegeex4-all-9B82.3%75.7%

Performance of open-source LLMs over various datasets, measured by pass@1

Figure 2: Performance of open-source LLMs over various datasets, measured by pass@1.

Evaluation Analysis

Here are the top performing models for each benchmark.

  1. HumanEval
  1. MBPP
  1. SWE-Bench
  1. LiveCodeBench

Summary:

Conclusion

While benchmarks like HumanEval, SWE-Bench, and LiveCodeBench offer valuable insights into model strengths across a range of coding tasks, they capture only a slice of overall performance. Real-world software development is far more complex and can depend on other factors.

It’s also important to recognize that both models and benchmarks are evolving rapidly. New datasets are regularly introduced, and existing ones are continuously refined to better mirror real-world challenges.

To stay up to date on the newest models for coding tasks, check out popular leaderboards like the BigCodeBench Leaderboard, LiveCodeBench Leaderboard, and EvalPlus Leaderboard, which track model performance across a wide range of coding tasks.

Appendix

Here’s some additional information about the models selected for evaluation.

OpenAI Models Claude Google Deepseek CodeGeeX4-All Mistral Qwen

Sources of reported metric values

Articles and Company Webpages

GitHub Links

Huggingface

Others

← Back to Blog