CodeTaste: Can LLMs Generate Human-Level Code Refactorings?

Researchers introduced CodeTaste, a novel benchmark evaluating Large Language Models' ability to discover human-level code refactorings. The study found frontier LLMs perform reliably with explicit instructions but struggle to autonomously identify structural improvements aligning with human judgment. CodeTaste provides both an evaluation target and potential preference signal for training future coding agents to better match human refactoring intuition.

CodeTaste: Can LLMs Generate Human-Level Code Refactorings?

Researchers have introduced CodeTaste, a novel benchmark designed to evaluate the ability of Large Language Model (LLM) coding agents to perform and, more critically, to *discover* the types of code refactorings that human developers implement in real-world projects. This work highlights a significant, previously unmeasured gap in AI-assisted software development: while models can follow explicit refactoring instructions, they struggle to autonomously identify the specific structural improvements that align with human judgment, a core competency for truly collaborative AI pair programmers.

Key Takeaways

  • Researchers created the CodeTaste benchmark by mining real, multi-file refactoring changes from open-source repositories to test LLM agents.
  • Frontier LLMs perform reliably when given detailed refactoring specifications but often fail to discover the human-chosen refactoring when only given a general "focus area" for code improvement.
  • A "propose-then-implement" strategy, where the agent first suggests multiple refactoring options, improves alignment with human decisions.
  • Selecting the best-aligned proposal from multiple candidates before implementation yields further performance gains on the benchmark.
  • CodeTaste provides both an evaluation target and a potential preference signal for training future coding agents to better match human refactoring intuition.

Benchmarking AI's Architectural Intuition

The CodeTaste benchmark addresses a fundamental limitation in current evaluations of coding LLMs. While benchmarks like HumanEval and MBPP test the ability to generate functionally correct code from scratch, they do not assess an agent's skill in improving *existing* codebases—a task that constitutes a massive portion of professional software engineering. CodeTaste is constructed by analyzing actual commit histories from open-source projects to extract instances where developers performed behavior-preserving transformations to reduce complexity, eliminate duplication, or pay down architectural debt.

To score an LLM agent's performance, CodeTaste employs a two-pronged verification system. First, it runs the repository's own test suites to ensure the refactoring does not break existing functionality. Second, and more innovatively, it uses custom static analysis checks. These checks employ dataflow reasoning to verify both the removal of undesirable code patterns (like code duplication or overly complex methods) and the introduction of desired patterns (like extracted methods or simplified conditionals). This moves evaluation beyond mere syntactic correctness to assess structural quality.

Industry Context & Analysis

The introduction of CodeTaste arrives at a pivotal moment in the AI coding assistant market. Tools like GitHub Copilot, Amazon Q Developer, and JetBrains AI Assistant have achieved widespread adoption by excelling at code completion and simple generation. However, their capability for higher-order architectural reasoning remains a frontier. This benchmark quantifies that gap, showing that even frontier models like GPT-4 and Claude 3, which achieve pass@1 scores above 80% on HumanEval, struggle with the open-ended discovery task in CodeTaste.

This challenge is distinct from, and arguably more complex than, traditional code generation. Unlike generating a function from a docstring, refactoring discovery requires a deep understanding of code semantics, design patterns, and idiomatic style within a specific codebase's context. It's an exercise in software engineering taste—hence the benchmark's name. The research findings suggest that current LLMs, trained on vast corpora of code, can recognize many bad patterns but lack a refined, contextual model of which specific refactoring is most appropriate and aligned with team norms.

The proposed "propose-then-implement" decomposition is a pragmatic engineering insight that mirrors successful human workflows. It effectively separates the creative/design phase (generating candidate solutions) from the execution phase (implementing the chosen one). This approach can be enhanced by integrating external linters or static analysis tools during the proposal phase to ground the LLM's suggestions in established best practices. The performance gain from selecting the best-aligned proposal also points toward a future where reinforcement learning from human feedback (RLHF) or direct preference optimization (DPO) using datasets like CodeTaste could fine-tune models specifically for architectural judgment.

What This Means Going Forward

For enterprise development teams, this research underscores that current AI assistants are powerful junior engineers for writing new code but are not yet reliable senior architects for system redesign. The immediate implication is that workflows should be designed to provide these agents with specific, constrained refactoring instructions rather than expecting them to autonomously identify the optimal large-scale improvement. The "propose-then-implement" pattern offers a practical framework for integrating AI into code review and tech debt reduction processes.

For AI model developers and tool vendors, CodeTaste establishes a crucial new north-star metric. Moving beyond raw correctness to measuring alignment with human design preferences is the next competitive battleground for coding agents. We can expect rapid iteration on models fine-tuned or trained with similar preference data. This could lead to specialized "architect" models or fine-tunes distinct from general-purpose coding models. Furthermore, the integration of CodeTaste-like evaluation into continuous integration pipelines could automatically flag AI-suggested changes that deviate from a team's established refactoring patterns.

The key trend to watch is the emergence of context-aware coding agents. The next leap will come from systems that don't just see the current file but understand the project's commit history, coding conventions, and architectural documents. The ability to answer "What refactoring would the senior lead on this project approve?" is the ultimate test CodeTaste points toward. Success here would transform AI from a code suggestion tool into a genuine collaborative partner in software design and maintenance, fundamentally changing the economics of managing large, legacy codebases.

常见问题

本文基于 arXiv cs.AI 的报道进行深度分析与改写。 阅读原文 →