GenerativeModels.ai
Blog

Evaluating LLM Proofreading Capabilities

R&D Team

Introduction

Large Language Models (LLMs) are increasingly used to generate content, from blog posts and emails to research summaries and technical reports. But writing with AI isn’t just about generating first drafts. A responsible and effective use of LLMs also includes reviewing and editing that output for accuracy, clarity, and tone.

Increasingly, users are turning to LLMs themselves to act as proofreaders, asking one model to review what another has written. But how reliable are these models at spotting real issues and offering meaningful, high-quality suggestions?

In this post, we evaluate the performance of popular LLMs specifically as proofreaders. We share our testing methodology, compare model outputs, and highlight where each excels, or falls short, when it comes to reviewing and refining AI-generated content.

TL;DR

Models Chosen

For the evaluations, we selected a mix of proprietary and open-source language models.

Proprietary models:

Open-source models:

Evaluating models

The Dataset

Dataset Creation

To evaluate the models, we generated a synthetic dataset of 400 sentence entries using GPT-4o, each paired with identified writing issues.

To ensure the issue identifications were accurate, the entries were proofread and refined by GPT-4.1.

We also manually made a few adjustments:

Dataset Format

Each dataset entry includes the following fields:

Note: For the evaluations, we focus only on type, description, and issue_text, as LLMs struggled to reliably predict character indices.

Issue Types

Each sentence contains either one issue or none.

When present, the issue falls into one of six categories:

Example Dataset Record

{
  "dataset": [
    {
      "id": 1,
      "text": "The team is working on finishing the project by end of next week. This is critical for our requirement.",
      "issues": [
        {
          "type": "Clarity",
          "description": "The phrase 'by end of next week' is awkward and can be clarified, for example as 'by the end of next week'.",
          "start": 42,
          "end": 59,
          "issue_text": "by end of next week"
        }
      ]
    },
		...
    {
      "id": 399,
      "text": "Leveraging big data can lead to better decision making. This is important.",
      "issues": [
        {
          "type": "Clarity",
          "description": "The phrase 'This is important' is vague; it should specify what exactly is important for improved clarity.",
          "start": 45,
          "end": 62,
          "issue_text": "This is important"
        }
      ]
    }
  ]
}

Evaluation

Criteria

Three fields were assessed during the evaluation:

Among these criteria, description was given the most emphasis in the results analysis, as it is more direct in identifying what the issue truly is.

Metric used: recall@1

We use the recall@1 metric to report the evaluation results. This means we compare the first issue identified by the LLM with the evaluation dataset record (which contains a maximum of one issue).

Results

Table 1: Recall@1 rates of LLM models evaluated

ModelProprietary / Open-sourceDescription recallIssue text recallIssue type recallAverage recall
Llama-4-MaverickOpen-source0.85460.62910.74690.7435
Claude 3.7 SonnetProprietary0.84710.57140.74440.7210
gpt-4.1Proprietary0.84210.65660.73430.7443
gpt-4oProprietary0.84210.57890.72680.7159
gpt-4.1-miniProprietary0.83710.61650.73680.7301
Deepseek V3Open-source0.82710.57140.74940.7160
Claude 3.5 HaikuProprietary0.82710.59400.72180.7143
Deepseek R1Open-source0.81950.56390.72930.7042
gpt-4o-miniProprietary0.80700.61150.71430.7109
gpt-4.1-nanoProprietary0.79200.61900.62150.6775
Llama-4-ScoutOpen-source0.75940.62160.63660.6725
Gemini 2.0 Flash LiteProprietary0.72680.53380.68920.6499
Gemini 2.0 FlashProprietary0.71680.51380.68670.6391
Gemini 1.5 ProProprietary0.70180.53880.68420.6416
Meta llama 3.3 turboOpen-source0.58400.58150.52380.5631
Meta llama 3.3 turbo freeOpen-source0.57890.58150.51880.5597

Figure 1: Recall@1 rates of LLM models evaluated

LLM Proofreading Evaluation Results

Analysis

Observable results

We placed the most emphasis on description recall, followed by issue_text recall, and finally issue type recall. The reason for this ranking is that, in some cases, models tend to report full sentences rather than focusing on the most important word(s) within the sentence.

Additionally, certain issue types may overlap. For example, some Punctuation errors could also be classified as Grammar & Spelling issues. This flexibility in what the models report as a “correct” issue means that we prioritized the most relevant and specific issue for each record.

From the reported metrics, it’s clear that the open-source LLM, Llama-4-Maverick, performed the strongest in providing descriptions of identified issues. Scored closely behind were the proprietary models Claude 3.7 Sonnet, gpt-4.1 and gpt-4o.

The worst performers in this task were Meta Llama 3.3 turbo and Meta Llama 3.3 turbo free.

Evaluation Considerations & Limitations

  1. Factual Accuracy Metric

While this issue doesn’t necessarily affect the evaluation dataset we’ve created (as it primarily involves natural language), it’s important to note that the factual accuracy metric can be flawed, especially when it comes to highly specialized subjects or information that requires additional research.

For example, if you were to ask an LLM like gpt-4o or the latest gpt-4.1 to identify issues in the following sentence:

“In the realm of evaluating large language models (LLMs) for coding tasks, the MBPP (Multi-step Benchmark for Programming Problems) stands out due to its unique focus on multi-step programming challenges.”

These models might return no issues. At first glance, it seems fine, but after a quick Google search, it’s revealed that MBPP stands for “Mostly Basic Python Programming,” not “Multi-step Benchmark for Programming Problems.”

This underscores the challenge LLMs face in evaluating information that depends on up-to-date or highly technical knowledge. Without access to external tools like search engines or specialized databases, these models can miss or misidentify factual inaccuracies.

  1. One issue, multiple issue types

During the process of limiting entries to one issue per record, an interesting case was observed. Consider the following record:

{
      "id": 386,
      "text": "Utilizing cutting edge technology it’s imperative for success.",
      "issues": [
        {
          "type": "Clarity",
          "description": "The sentence structure is unclear; it seems to be missing a verb linking 'Utilizing cutting edge technology' to the predicate. Consider rephrasing for clarity, such as: 'Utilizing cutting-edge technology is imperative for success.'",
          "start": 0,
          "end": 48,
          "issue_text": "Utilizing cutting edge technology it’s imperative for success."
        },
        {
          "type": "Grammar & Spelling",
          "description": "The sentence lacks proper grammatical structure; it is a fused sentence without a main verb in the first clause. A verb like 'is' is needed to make it grammatically correct.",
          "start": 0,
          "end": 48,
          "issue_text": "Utilizing cutting edge technology it’s imperative for success."
        },
      ]
    },

In this case, both issues were relevant to the text. Both observations point to the lack of the verb “is,” which affects both the clarity and grammatical structure of the sentence.

This example highlights that the identification of issue types isn’t always completely clear-cut. Multiple issue types may be applicable to a single issue, and in such cases, scoring the descriptions of the issues becomes particularly important.

Conclusion

The evaluations show that most of the tested LLMs performed well, with recall@1 scores in the 80% range. Llama-4-Maverick stands out as the top performer, achieving a recall@1 score of 85.46%, with Claude 3.7 Sonnet, gpt-4.1 and gpt-4o following closely behind.

← Back to Blog