Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions

Researchers introduced RealPref, a benchmark for evaluating how large language models understand and adhere to complex, long-term user preferences in realistic assistant scenarios. The benchmark features 100 user profiles, 1,300 personalized preferences, and tests four types of preference expression from explicit to implicit. Key findings show LLM performance degrades with longer context and more implicit preferences, highlighting challenges in user-aware AI systems.

Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions

Researchers have introduced RealPref, a new benchmark designed to rigorously test how well large language models (LLMs) understand and adhere to complex, long-term user preferences in realistic assistant scenarios. This work addresses a critical gap in AI evaluation, moving beyond single-turn tasks to assess the nuanced challenge of personalization over extended conversations, which is essential for the next generation of truly helpful AI assistants.

Key Takeaways

  • RealPref is a new benchmark for evaluating realistic, long-term preference-following in LLM-based personal assistants.
  • It features 100 user profiles, 1,300 personalized preferences, four types of preference expression (explicit to implicit), and long-horizon interaction histories.
  • Evaluation includes multiple-choice, true-or-false, and open-ended questions, with detailed rubrics for LLM-as-a-judge scoring.
  • Key findings show LLM performance significantly degrades with longer context and more implicit preferences, and models struggle to generalize preferences to unseen scenarios.
  • The benchmark's code and data are publicly available to spur further research into user-aware AI systems.

Benchmarking Real-World Personalization

The RealPref benchmark is constructed to mirror the complexity of real-world human-AI interaction. It is built around 100 distinct user profiles, each containing a set of 1,300 personalized preferences that cover diverse aspects of daily life, from communication style to dietary restrictions. Crucially, the benchmark tests four progressively challenging types of preference expression: Explicit (direct user statements), Descriptive (detailed narratives), Implicit (inferred from behavior), and Latent (underlying values not directly stated).

To simulate long-term use, RealPref incorporates extended interaction histories, testing an LLM's ability to maintain a coherent user model over time. Evaluation is conducted through three question formats—multiple-choice, true-or-false, and open-ended—with performance judged using detailed, structured rubrics designed for reliable LLM-as-a-judge assessment. The initial results are revealing: model accuracy notably decreases as the conversational context lengthens and as preference cues shift from explicit to implicit. Furthermore, LLMs show a marked difficulty in correctly applying learned user preferences to novel, unseen situations.

Industry Context & Analysis

The introduction of RealPref arrives at a pivotal moment in the AI industry's shift from general-purpose chatbots to personalized assistants. While current models like OpenAI's GPT-4, Anthropic's Claude, and Google's Gemini excel at broad knowledge tasks, their ability to maintain a persistent, nuanced understanding of an individual user remains a significant frontier. This benchmark provides the missing tool to quantify progress in that domain.

Unlike existing evaluation suites that focus on short-term reasoning (e.g., MMLU for knowledge or HumanEval for coding), RealPref specifically targets the longitudinal and inferential challenges of personalization. Its findings on context-length degradation highlight a practical limitation of current transformer-based architectures, which can struggle with information retrieval from the middle of very long sequences—a problem that techniques like Google's Gemini 1.5 with its 1M+ token context window are actively trying to solve.

The benchmark's structure also implicitly critiques the prevailing "stateless" interaction model of most chatbots. The poor performance on implicit and latent preferences suggests that simply feeding a prior conversation into a context window is insufficient. This underscores the industry's growing investment in long-term memory and user model systems, as seen in startups like Personal.ai and research into "LLM Agents" that can maintain and update a persistent user profile. RealPref provides the standardized metrics needed to compare these emerging approaches.

What This Means Going Forward

For AI developers and researchers, RealPref establishes a crucial new axis for model evaluation and competition. Going forward, leaderboards for assistant-style LLMs may need to include a "personalization" or "preference adherence" score alongside traditional benchmarks. This will drive R&D away from pure scale and toward more efficient architectures for reasoning over long-term memory and subtle user intent.

The primary beneficiaries of this research direction will be enterprises building dedicated AI assistants for customer service, coaching, or therapy, where understanding client history and preferences is paramount. It also creates a competitive opportunity for smaller, specialized models that can excel at deep personalization without requiring the massive parameter counts of frontier models.

Key developments to watch will be how quickly major model providers integrate RealPref into their evaluation reports and whether any model demonstrates a breakthrough in handling implicit preferences. Furthermore, the public availability of the benchmark will likely spur a wave of academic and open-source projects focused on techniques like preference distillation, continual learning for user models, and improved context management. RealPref doesn't just identify a problem; it provides the toolkit to start solving it, marking a significant step toward AI that doesn't just answer questions but truly understands the person asking them.

常见问题

本文基于 arXiv cs.AI 的报道进行深度分析与改写。 阅读原文 →