CodeTaste: Can LLMs Generate Human-Level Code Refactorings?

Researchers from Carnegie Mellon University and Microsoft Research introduced CodeTaste, a benchmark evaluating LLMs' ability to perform human-like code refactoring. The study found that while models like GPT-4 and Claude 3 can execute explicit refactoring instructions, they struggle to independently identify the specific improvements human developers would choose. The benchmark uses real-world refactoring tasks from open-source repositories and combines test suites with static analysis to verify behavior-preserving improvements.

CodeTaste: Can LLMs Generate Human-Level Code Refactorings?

Researchers from Carnegie Mellon University and Microsoft Research have introduced CodeTaste, a novel benchmark designed to evaluate how well large language model (LLM) coding agents can understand and execute the nuanced, structural improvements known as code refactoring. This work moves beyond simple code generation to assess a model's ability to improve code quality in ways that align with human developer intuition, a critical step toward building truly collaborative AI pair programmers.

Key Takeaways

  • Researchers created the CodeTaste benchmark, comprising real-world refactoring tasks mined from multi-file changes in open-source repositories.
  • Experiments show a significant gap in LLM performance: agents execute detailed refactoring instructions reliably but struggle to independently discover the specific improvements a human developer would choose.
  • A "propose-then-implement" strategy, where the model first suggests multiple refactoring plans, improves alignment with human choices, especially when the best-aligned proposal is selected before implementation.
  • The benchmark uses a combination of repository test suites and custom static analysis to verify that refactorings are behavior-preserving and successfully introduce desired structural patterns.

Benchmarking AI's Sense of Code Quality

The CodeTaste benchmark is constructed from real, large-scale refactoring commits in open-source projects, transforming them into tasks where an AI agent is given a code snippet and a "focus area" for improvement. The agent's goal is not just to write functional code but to produce a refactoring that matches the precise, structural change a human developer made. Scoring is rigorous, combining the project's own test suites to ensure behavior is preserved with custom static checks that use dataflow reasoning to verify the removal of bad patterns (like duplication) and the introduction of good ones (like improved abstraction).

The core finding reveals a distinct capability split in current frontier models, including those like GPT-4 and Claude 3. When provided with explicit, step-by-step refactoring instructions (e.g., "extract this method and rename that variable"), agents perform competently. However, when only given a high-level prompt about a code section needing improvement, their ability to independently identify the exact refactoring a human selected drops significantly. This indicates that while LLMs possess vast syntactic knowledge, their "taste" for code quality—the intuitive judgment of which structural change is most beneficial—is not yet fully aligned with human expertise.

Industry Context & Analysis

This research directly addresses a growing pain point in the AI-assisted development landscape. While benchmarks like HumanEval and MBPP measure raw code generation accuracy, and tools like GitHub Copilot excel at autocompletion, the long-term maintainability of AI-generated code is a major concern. A 2023 study from GitClear analyzed 153 million lines of code and found that AI-assisted code was more likely to be reverted or updated shortly after being written, suggesting it may introduce more "churn" and technical debt. CodeTaste introduces a vital new metric: architectural alignment.

The proposed "propose-then-implement" decomposition is a pragmatic engineering solution that mirrors best practices in human software design. It effectively separates the creative, design-oriented task of identifying refactorings from the mechanical task of implementing them. This approach shows how LLM performance can be boosted not just by scaling parameters—as seen in the race from models with 70B parameters to dense mixtures-of-experts with over 1T parameters—but by structuring the task itself. It contrasts with the more monolithic, single-pass code generation that characterizes many current AI coding tools.

Furthermore, the use of static analysis for verification is a significant technical differentiator. Unlike simply relying on test passage, which only confirms functional correctness, the dataflow-based checks in CodeTaste can validate structural correctness. This is akin to the difference between a linter that catches style errors and a tool like SonarQube or CodeClimate that identifies deeper code smells and security vulnerabilities. It provides a more nuanced signal for training and evaluating models on quality, not just functionality.

What This Means Going Forward

The immediate beneficiaries of this work are organizations building the next generation of AI coding assistants, such as Replit, Sourcegraph Cody, and the teams behind GitHub Copilot and Amazon CodeWhisperer. CodeTaste provides a concrete benchmark and a potential preference signal—the "human-chosen" refactoring—that can be used to fine-tune models to have better architectural judgment. This could lead to assistants that not only suggest lines of code but proactively flag areas for simplification and offer high-quality refactoring options.

For development teams, the research underscores that current AI is best leveraged as a junior engineer capable of executing clear instructions, not as a senior architect. The workflow implication is clear: developers will need to provide more specific, high-level design direction to get the best structural improvements from AI tools. The "propose-then-implement" pattern could itself be productized, creating a new class of interactive refactoring tools where the AI suggests several improvement paths for a developer to select from before it makes any changes.

Watch for two key developments next. First, the integration of benchmarks like CodeTaste into the model evaluation suites of major AI labs, alongside traditional metrics like HumanEval scores. Second, the emergence of specialized "code quality" models or fine-tunes that trade some raw coding breadth for deeper understanding of software design principles, potentially trained on curated datasets of exemplary refactoring commits. The race is no longer just about writing code that works, but about writing code that lasts.

常见问题

本文基于 arXiv cs.AI 的报道进行深度分析与改写。 阅读原文 →