Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions

The RealPref benchmark evaluates how well large language models follow complex user preferences over extended interactions. It features 100 user profiles, 1300 personalized preferences, and four types of preference expression from explicit to implicit. Key findings show LLM performance significantly declines as context length increases and preferences become more implicit.

Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions

The introduction of RealPref, a new benchmark for evaluating how well large language models (LLMs) follow complex user preferences over long-term interactions, addresses a critical gap in the development of truly personalized AI assistants. As models increasingly move from general-purpose chatbots to dedicated personal agents, their ability to understand and adhere to nuanced, evolving user needs becomes paramount, making this research foundational for the next wave of AI applications.

Key Takeaways

  • The RealPref benchmark is designed to evaluate realistic, long-term preference-following in personalized user-LLM interactions.
  • It features 100 user profiles, 1300 personalized preferences, four types of preference expression (from explicit to implicit), and long-horizon interaction histories.
  • Evaluation uses three question types (multiple-choice, true-or-false, open-ended) with detailed rubrics for LLM-as-a-judge assessment.
  • Key findings show LLM performance significantly drops as context length increases and preferences become more implicit, and that generalizing understanding to unseen scenarios is challenging.
  • The benchmark and code are publicly available to support future research into user-aware AI assistants.

Benchmarking the Personal AI Assistant

The core innovation of RealPref is its focus on the longitudinal and multifaceted nature of real-world human preferences. Unlike static question-answer datasets, it simulates extended interactions where a user's stated and unstated preferences must be tracked and applied. The benchmark's construction is comprehensive, built around 100 distinct user profiles, each with an average of 13 associated preferences, culminating in 1300 unique preference statements.

These preferences are expressed across a spectrum of four types: Explicit Statement, Implication, Behavior Demonstration, and Contradiction Correction. This range is crucial for testing an LLM's ability to move beyond simple command-following to inferring intent from subtler cues, much like a human assistant would. The evaluation framework employs multiple-choice, true-or-false, and open-ended questions to test comprehension, recall, and application, with detailed rubrics enabling automated evaluation using an LLM-as-a-judge methodology, a technique popularized by frameworks like MT-Bench and AlpacaEval.

Industry Context & Analysis

The development of RealPref arrives at a pivotal moment in the AI industry's shift from generalist chatbots to specialized, persistent agents. Companies like OpenAI, Anthropic, and Google are heavily investing in agentic frameworks where an LLM can perform multi-step tasks. However, most public benchmarks, such as MMLU (massive multitask language understanding) or HumanEval (code generation), test broad knowledge or specific skills, not sustained personalization. RealPref fills this void by providing a standardized test for a core competency of future AI products: personalized memory and adaptation.

The paper's key finding—that performance degrades with longer context and more implicit preferences—has direct technical and product implications. It underscores a fundamental limitation of current transformer-based architectures: while context windows are expanding to 1M tokens in models like Claude 3, the ability to reliably attend to and reason over distant, subtle details remains a challenge. This is not just a scaling issue but a reasoning one. Furthermore, the struggle to generalize preferences to unseen scenarios highlights that current fine-tuning and prompting techniques may create narrow "user simulators" rather than models with a deep, transferable theory of mind about user intent.

This research connects to a broader trend of "human-like" interaction benchmarks. For instance, Chatbot Arena ranks models based on human preference votes, but these are often single-turn interactions. RealPref adds the critical dimension of continuity. Its release as an open-source project on GitHub allows for immediate integration into the development cycles of both academic labs and industry teams, enabling direct comparison of different architectural approaches, such as retrieval-augmented generation (RAG) for memory versus fine-tuning, on a common, rigorous ground.

What This Means Going Forward

For AI developers and product teams, RealPref provides a crucial tool for the road ahead. The benchmark will immediately benefit companies building persistent AI companions, such as Inflection AI's Pi or Meta's ongoing AI agent work, by giving them a quantifiable metric for personalization beyond user satisfaction surveys. It creates a clear research target: improving scores on RealPref's implicit and long-horizon tasks will correlate directly with more capable and "stickier" personal AI products.

The market implication is significant. As AI features become commoditized, the quality of personalization will be a key differentiator. A model that scores highly on RealPref could command a premium in consumer and enterprise markets, similar to how models with top HumanEval scores are favored for developer tools. We should expect rapid iteration, with leading labs likely to publish RealPref scores alongside traditional benchmarks in their next model releases.

Watch for several key developments in the wake of this benchmark. First, how quickly major closed-source and open-source models (e.g., GPT-4, Claude 3, Llama 3) are evaluated on it, establishing a new performance hierarchy. Second, the emergence of novel techniques specifically designed to tackle the long-horizon implicit preference problem, potentially involving new memory architectures or reinforcement learning from long-term feedback. Finally, RealPref may spawn a subfield of similar benchmarks for specific domains like healthcare adherence or personalized education, cementing the evaluation of longitudinal user understanding as a cornerstone of applied AI research.

常见问题

本文基于 arXiv cs.AI 的报道进行深度分析与改写。 阅读原文 →