Introduction

Large Language Models (LLMs) are increasingly used to generate content, from blog posts and emails to research summaries and technical reports. But writing with AI isn’t just about generating first drafts. A responsible and effective use of LLMs also includes reviewing and editing that output for accuracy, clarity, and tone.

Increasingly, users are turning to LLMs themselves to act as proofreaders, asking one model to review what another has written. But how reliable are these models at spotting real issues and offering meaningful, high-quality suggestions?

In this post, we evaluate the performance of popular LLMs specifically as proofreaders. We share our testing methodology, compare model outputs, and highlight where each excels, or falls short, when it comes to reviewing and refining AI-generated content.

TL;DR

We evaluated LLMs as proofreaders by testing how well they could detect and describe issues in text.
A synthetic dataset of 399 short texts with known issues was created using GPT-4o and GPT-4.1, then manually adjusted to ensure only one issue per entry.
Models evaluated included both proprietary (like GPT-4.1, GPT-4o) and open-source models (like Llama-4-Maverick, Meta Llama 3.3 Turbo).
Evaluation criteria focused most heavily on whether models correctly described the issues, followed by identifying the correct text span and issue type.
Best performer: The open-source Llama-4-Maverick slightly outperformed GPT-4.1 and GPT-4o in issue description quality.
Notable finding: Meta Llama 3.3 models underperformed significantly compared to others.
Caution: Evaluations around factual accuracy may be unreliable for complex or niche technical topics without external fact-checking.

Models Chosen

For the evaluations, we selected a mix of proprietary and open-source language models.

Proprietary models:

gpt-4.1
gpt-4.1-mini
gpt-4.1-nano
gpt-4o
gpt-4o-mini
claude-3-7-sonnet
claude-3-5-haiku
gemini-2.0-flash-lite
gemini-2.0-flash
gemini-1.5-pro

Open-source models:

deepseek-v3
deepseek-r1
llama-4-maverick
llama-4-scout
meta-llama-3.3-turbo
metal-llama-3.3-turbo-free

Evaluating models

The Dataset

Dataset Creation

To evaluate the models, we generated a synthetic dataset of 400 sentence entries using GPT-4o, each paired with identified writing issues.

To ensure the issue identifications were accurate, the entries were proofread and refined by GPT-4.1.

We also manually made a few adjustments:

Ensuring no more than one issue per record: About 30 entries contained multiple issues. These were simplified to contain only one or, if needed, lightly edited.
Replacing special characters: A few records contained special characters, most notably the curly apostrophe (’). This caused problems when trying to match the record ID based on the text, as some models automatically corrected the curly apostrophe to a straight apostrophe (’), making the original record hard to track. To fix this, special characters were replaced with standard versions (e.g., curly apostrophes were replaced with straight ones).
Removing duplicates: We identified and removed duplicate entries, resulting in a final dataset of 399 records.

Dataset Format

Each dataset entry includes the following fields:

id: A unique record ID, starting at 1 and incrementing up to 399.
type: One of six issue types (described below).
description: A detailed explanation of the issue.
issue_text: The specific substring where the issue appears.
start and end: Character indices marking the position of the issue text within the sentence (start is inclusive, end is non-inclusive).

Note: For the evaluations, we focus only on type, description, and issue_text, as LLMs struggled to reliably predict character indices.

Issue Types

Each sentence contains either one issue or none.

When present, the issue falls into one of six categories:

Clarity: Spot sentences or phrases that are unclear, ambiguous, or overly complex. Suggest ways to simplify or rephrase for easier understanding.
Style: Check that the tone, formality, and word choice are appropriate for the target audience. Flag unnecessary verbosity and ensure consistency. For these evaluations, the target audience is set to technical leaders and decision-makers.
Coherence: Assess the logical flow between sections. Ensure ideas are presented clearly, transitions are smooth, and the overall structure is easy to follow.
Factual Accuracy: Identify any technical inaccuracies or misleading statements.
Grammar & Spelling: Catch errors in sentence structure, verb tense, agreement, and spelling.
Punctuation: Check for correct and consistent punctuation usage.

Example Dataset Record

{
  "dataset": [
    {
      "id": 1,
      "text": "The team is working on finishing the project by end of next week. This is critical for our requirement.",
      "issues": [
        {
          "type": "Clarity",
          "description": "The phrase 'by end of next week' is awkward and can be clarified, for example as 'by the end of next week'.",
          "start": 42,
          "end": 59,
          "issue_text": "by end of next week"
        }
      ]
    },
		...
    {
      "id": 399,
      "text": "Leveraging big data can lead to better decision making. This is important.",
      "issues": [
        {
          "type": "Clarity",
          "description": "The phrase 'This is important' is vague; it should specify what exactly is important for improved clarity.",
          "start": 45,
          "end": 62,
          "issue_text": "This is important"
        }
      ]
    }
  ]
}

Evaluation

Criteria

Three fields were assessed during the evaluation:

issue_text: Does the issue identified by the model match the issue text in the dataset?
type: Is the identified issue type the same as in the dataset?
description: Are the descriptions semantically equivalent? We use GPT-4o to decide.

Among these criteria, description was given the most emphasis in the results analysis, as it is more direct in identifying what the issue truly is.

Metric used: recall@1

We use the recall@1 metric to report the evaluation results. This means we compare the first issue identified by the LLM with the evaluation dataset record (which contains a maximum of one issue).

Results

Table 1: Recall@1 rates of LLM models evaluated

Model	Proprietary / Open-source	Description recall	Issue text recall	Issue type recall	Average recall
Llama-4-Maverick	Open-source	0.8546	0.6291	0.7469	0.7435
Claude 3.7 Sonnet	Proprietary	0.8471	0.5714	0.7444	0.7210
gpt-4.1	Proprietary	0.8421	0.6566	0.7343	0.7443
gpt-4o	Proprietary	0.8421	0.5789	0.7268	0.7159
gpt-4.1-mini	Proprietary	0.8371	0.6165	0.7368	0.7301
Deepseek V3	Open-source	0.8271	0.5714	0.7494	0.7160
Claude 3.5 Haiku	Proprietary	0.8271	0.5940	0.7218	0.7143
Deepseek R1	Open-source	0.8195	0.5639	0.7293	0.7042
gpt-4o-mini	Proprietary	0.8070	0.6115	0.7143	0.7109
gpt-4.1-nano	Proprietary	0.7920	0.6190	0.6215	0.6775
Llama-4-Scout	Open-source	0.7594	0.6216	0.6366	0.6725
Gemini 2.0 Flash Lite	Proprietary	0.7268	0.5338	0.6892	0.6499
Gemini 2.0 Flash	Proprietary	0.7168	0.5138	0.6867	0.6391
Gemini 1.5 Pro	Proprietary	0.7018	0.5388	0.6842	0.6416
Meta llama 3.3 turbo	Open-source	0.5840	0.5815	0.5238	0.5631
Meta llama 3.3 turbo free	Open-source	0.5789	0.5815	0.5188	0.5597

Figure 1: Recall@1 rates of LLM models evaluated

LLM Proofreading Evaluation Results

Analysis

Observable results

We placed the most emphasis on description recall, followed by issue_text recall, and finally issue type recall. The reason for this ranking is that, in some cases, models tend to report full sentences rather than focusing on the most important word(s) within the sentence.

Additionally, certain issue types may overlap. For example, some Punctuation errors could also be classified as Grammar & Spelling issues. This flexibility in what the models report as a “correct” issue means that we prioritized the most relevant and specific issue for each record.

From the reported metrics, it’s clear that the open-source LLM, Llama-4-Maverick, performed the strongest in providing descriptions of identified issues. Scored closely behind were the proprietary models Claude 3.7 Sonnet, gpt-4.1 and gpt-4o.

The worst performers in this task were Meta Llama 3.3 turbo and Meta Llama 3.3 turbo free.

Evaluation Considerations & Limitations

Factual Accuracy Metric

While this issue doesn’t necessarily affect the evaluation dataset we’ve created (as it primarily involves natural language), it’s important to note that the factual accuracy metric can be flawed, especially when it comes to highly specialized subjects or information that requires additional research.

For example, if you were to ask an LLM like gpt-4o or the latest gpt-4.1 to identify issues in the following sentence:

“In the realm of evaluating large language models (LLMs) for coding tasks, the MBPP (Multi-step Benchmark for Programming Problems) stands out due to its unique focus on multi-step programming challenges.”

These models might return no issues. At first glance, it seems fine, but after a quick Google search, it’s revealed that MBPP stands for “Mostly Basic Python Programming,” not “Multi-step Benchmark for Programming Problems.”

This underscores the challenge LLMs face in evaluating information that depends on up-to-date or highly technical knowledge. Without access to external tools like search engines or specialized databases, these models can miss or misidentify factual inaccuracies.

One issue, multiple issue types

During the process of limiting entries to one issue per record, an interesting case was observed. Consider the following record:

{
      "id": 386,
      "text": "Utilizing cutting edge technology it’s imperative for success.",
      "issues": [
        {
          "type": "Clarity",
          "description": "The sentence structure is unclear; it seems to be missing a verb linking 'Utilizing cutting edge technology' to the predicate. Consider rephrasing for clarity, such as: 'Utilizing cutting-edge technology is imperative for success.'",
          "start": 0,
          "end": 48,
          "issue_text": "Utilizing cutting edge technology it’s imperative for success."
        },
        {
          "type": "Grammar & Spelling",
          "description": "The sentence lacks proper grammatical structure; it is a fused sentence without a main verb in the first clause. A verb like 'is' is needed to make it grammatically correct.",
          "start": 0,
          "end": 48,
          "issue_text": "Utilizing cutting edge technology it’s imperative for success."
        },
      ]
    },

In this case, both issues were relevant to the text. Both observations point to the lack of the verb “is,” which affects both the clarity and grammatical structure of the sentence.

This example highlights that the identification of issue types isn’t always completely clear-cut. Multiple issue types may be applicable to a single issue, and in such cases, scoring the descriptions of the issues becomes particularly important.

Conclusion

The evaluations show that most of the tested LLMs performed well, with recall@1 scores in the 80% range. Llama-4-Maverick stands out as the top performer, achieving a recall@1 score of 85.46%, with Claude 3.7 Sonnet, gpt-4.1 and gpt-4o following closely behind.

Evaluating LLM Proofreading Capabilities