Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions

Researchers introduced the RealPref benchmark to evaluate how well large language models understand and follow complex user preferences over extended interactions. The benchmark features 100 detailed user profiles with 1,300 personalized preferences across four expression types, revealing significant performance declines as context grows and preferences become implicit. Findings show LLMs struggle with generalization and maintaining coherent user understanding in long-horizon scenarios.

Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions

Researchers have introduced RealPref, a new benchmark designed to rigorously test how well large language models (LLMs) can understand and adhere to complex, evolving user preferences over long-term interactions. This work addresses a critical gap in AI evaluation, moving beyond single-turn tasks to assess the nuanced, personalized assistance required for real-world AI companions, with findings revealing significant performance drops as context grows and preferences become less explicit.

Key Takeaways

  • Researchers have developed the RealPref benchmark to evaluate LLMs on realistic, long-term preference-following, featuring 100 user profiles and 1,300 personalized preferences.
  • The benchmark tests four types of preference expression (explicit to implicit) and includes long-horizon interaction histories to simulate extended use.
  • Evaluation uses three question types (multiple-choice, true-or-false, open-ended) with detailed rubrics for LLM-as-a-judge assessment.
  • Key findings show LLM performance significantly declines as context length increases and preference expression becomes more implicit, and models struggle to generalize understanding to unseen scenarios.
  • The code and benchmark are publicly available, providing a foundation for developing more user-aware AI assistants.

Introducing the RealPref Benchmark

The newly proposed RealPref benchmark is a systematic framework for evaluating personalized preference-following in AI assistants. It is constructed around 100 detailed user profiles, each containing 13 distinct preferences across categories like food, entertainment, and scheduling, totaling 1,300 personalized preference statements. This structure moves far beyond typical single-query benchmarks to model the complexity of a real user relationship.

To simulate realistic interaction dynamics, RealPref incorporates four progressively challenging types of preference expression. These range from Explicit Preferences (direct user statements) to Implicit Preferences (inferred from behavior or indirect cues), testing an LLM's ability to parse subtlety. The benchmark's core innovation is its use of long-horizon interaction histories, building context over multiple turns to assess if models can maintain a coherent, evolving understanding of a user.

Evaluation is conducted through three question types: multiple-choice, true-or-false, and open-ended. For consistent scoring, especially on open-ended responses, the benchmark provides detailed rubrics for an LLM-as-a-judge evaluation setup. The initial results are revealing: LLM performance shows a marked decline as the conversational context lengthens and as the required preference inference shifts from explicit to implicit. Furthermore, models exhibit difficulty in generalizing their understanding of a user's preferences to completely new, unseen scenarios.

Industry Context & Analysis

The introduction of RealPref arrives at a pivotal moment in the AI industry's shift from general-purpose chatbots to personalized agents. While leading models like OpenAI's GPT-4, Anthropic's Claude, and Google's Gemini excel on broad knowledge benchmarks like MMLU (Massive Multitask Language Understanding) or coding tasks like HumanEval, their ability to maintain persistent, nuanced user alignment remains a largely unmeasured frontier. RealPref directly targets this gap, similar to how MT-Bench and AlpacaEval advanced conversation quality evaluation, but with a dedicated focus on long-term personalization.

The benchmark's findings on context-length degradation highlight a fundamental architectural and economic challenge. While models may perform well in short snippets, the documented performance drop over long interactions underscores limitations in current context window utilization and long-term memory mechanisms. This has direct implications for products like Microsoft's Copilot, Google's Gemini Advanced, and startups like Personal.ai or Inflection AI's Pi, whose value propositions hinge on building a deep, persistent understanding of the user. The struggle with implicit preferences further suggests that today's LLMs, often trained on explicit web text, lack robust theory-of-mind capabilities necessary for true personal assistance.

From a technical perspective, RealPref provides a much-needed metric for an emerging research area. The pursuit of persistent personalization is driving innovation in methods like Retrieval-Augmented Generation (RAG) with user-specific vector stores, fine-tuning on user data, and the development of lightweight, continuously updated LLM personas. RealPref offers a standardized way to compare these approaches. Its release as an open-source project on GitHub (a common practice for AI benchmarks like Big-Bench or HELM) will accelerate community adoption and provide a crucial feedback loop for model developers aiming to build the next generation of AI companions.

What This Means Going Forward

The RealPref benchmark establishes a new performance baseline that will likely shape the development roadmap for AI assistant companies. In the near term, we can expect leading AI labs to report RealPref scores alongside traditional benchmarks, using it to showcase advances in personalization. This could create a competitive sub-category similar to the race for higher HumanEval scores among coding models. Startups focusing on niche, personalized AI will benefit from a clear evaluation framework to demonstrate their specialized value against general-purpose giants.

For users and enterprises, the benchmark underscores a critical evolution in what to expect from AI tools. Effective personalization is not a given; it is a measurable capability that varies between models. As these assistants become more integrated into daily workflows and decision-making, their ability to reliably remember and adapt to individual preferences—a core finding RealPref helps quantify—will become a primary differentiator for adoption and trust.

Key developments to watch will include how quickly model performance on RealPref improves, which architectural innovations (e.g., better long-context processing, recurrent memory units) drive those gains, and whether a leading "RealPref score" emerges as a key marketing metric. Furthermore, the benchmark may spur related work on privacy-preserving personalization techniques, as the user profiles required for such deep adaptation raise significant data security considerations. RealPref has effectively turned the abstract goal of a "personal AI" into a concrete, measurable engineering challenge.

常见问题

本文基于 arXiv cs.AI 的报道进行深度分析与改写。 阅读原文 →