Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants

A new research blueprint provides a production-ready framework for evaluating and optimizing conversational shopping assistants (CSAs), particularly for grocery shopping. It introduces a multi-faceted evaluation rubric paired with a calibrated LLM-as-judge pipeline aligned with human annotations, and explores two GEPA-based optimization strategies: Sub-agent GEPA for individual nodes and novel MAMuT GEPA for system-level, multi-turn optimization.

Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants

The emergence of conversational shopping assistants (CSAs) as a viable AI application highlights a critical industry gap: the lack of robust, production-ready frameworks for evaluating and optimizing the complex, multi-turn interactions these systems require. A new research blueprint directly addresses this by providing a structured methodology for moving CSAs from prototype to reliable product, with a particular focus on the high-stakes, preference-sensitive domain of grocery shopping.

Key Takeaways

  • A new research paper presents a practical blueprint for evaluating and optimizing production-scale conversational shopping assistants (CSAs), with a focus on grocery shopping.
  • The core contribution is a multi-faceted evaluation rubric that decomposes shopping quality into structured dimensions, paired with a calibrated LLM-as-judge pipeline aligned with human annotations.
  • The research investigates two prompt-optimization strategies using the GEPA optimizer: Sub-agent GEPA for individual agent nodes and a novel MAMuT GEPA for system-level, multi-turn optimization.
  • The authors are releasing rubric templates and evaluation design guidance to support industry practitioners.

A Blueprint for Production-Ready Shopping Assistants

The paper identifies two primary, underexplored challenges in deploying CSAs: evaluating multi-turn interactions and optimizing tightly coupled multi-agent systems. The grocery shopping domain amplifies these difficulties, as user requests are often underspecified (e.g., "something healthy for dinner"), highly sensitive to personal preferences (e.g., dietary restrictions, brand loyalty), and constrained by practical factors like budget and real-time inventory.

To tackle evaluation, the authors introduce a multi-faceted rubric that decomposes the nebulous concept of "end-to-end shopping quality" into structured, measurable dimensions. This is paired with a calibrated LLM-as-judge pipeline, a technique where a large language model is used to score interactions. Crucially, this automated scoring is aligned with human annotations to ensure reliability, addressing a common pitfall where LLM judges diverge from human judgment.

Building on this evaluation foundation, the research explores optimization using the state-of-the-art prompt optimizer GEPA (Shao et al., 2025). It tests two complementary strategies. The first, Sub-agent GEPA, optimizes prompts for individual agent nodes (e.g., a product retrieval agent, a budget checker) against localized performance rubrics. The second is a novel, system-level approach called MAMuT GEPA (Multi-Agent Multi-Turn GEPA) (Herrera et al., 2026). This method jointly optimizes prompts across the entire multi-agent system by using multi-turn simulation and trajectory-level scoring, aiming to improve holistic performance rather than isolated components.

Industry Context & Analysis

This work arrives as major tech and retail players aggressively invest in AI shopping agents, yet struggle with consistent quality. For instance, Amazon's Rufus and Walmart's generative AI search represent high-profile deployments, but user feedback often cites issues with relevance, personalization, and handling complex, multi-faceted requests—precisely the gaps this research aims to close. Unlike a purely academic benchmark, this blueprint is framed for production, focusing on the messy realities of inventory constraints and subjective user preferences.

The technical approach contrasts with common industry practices. Many companies rely on simplistic metrics like click-through rate or single-turn task completion, which fail to capture the nuanced success of a conversational journey. The proposed multi-dimensional rubric and LLM-as-judge pipeline offer a more sophisticated alternative. Furthermore, while fine-tuning models on shopping data is a common optimization path, the paper's focus on prompt optimization via GEPA is significant. Prompt optimization is generally faster and less resource-intensive than full model retraining, a practical advantage for rapidly iterating in a live retail environment where product catalogs and promotions change daily.

The introduction of MAMuT GEPA for system-level optimization is a notable advancement. In complex agentic systems, optimizing components in isolation can lead to sub-optimal or even conflicting behaviors—a phenomenon known as the "composition problem." By jointly optimizing across agents and turns, MAMuT GEPA seeks to create a more coherent and effective assistant, analogous to how end-to-end training improved the performance of earlier AI systems like autonomous driving models. This system-level thinking is often missing from first-generation AI agent deployments.

What This Means Going Forward

For enterprise retailers and technology providers, this blueprint provides a much-needed methodological foundation. The release of rubric templates and design guidance lowers the barrier to entry for teams seeking to build evaluation suites that go beyond basic accuracy, measuring critical factors like adherence to budget, preference sensitivity, and conversational coherence. This could accelerate the path to reliable CSA deployment in grocery and adjacent verticals like fashion or electronics retail.

The primary beneficiaries will be product managers and ML engineers tasked with deploying and maintaining these AI systems. They gain a structured playbook for continuous improvement. The research also suggests a shift in competitive advantage: as foundational LLM capabilities become more commoditized, the edge will go to companies that excel at evaluation and optimization of multi-agent workflows—the intricate "orchestration layer" that turns a powerful LLM into a reliable product.

Looking ahead, key developments to watch will be the open-source release and community adoption of the proposed rubrics, and independent benchmarks comparing the performance of Sub-agent versus MAMuT GEPA optimization in real-world settings. Furthermore, as these systems scale, integrating real-time data streams—from dynamic pricing and inventory to individualized purchase history—will be the next frontier. The blueprint provided here establishes the evaluation framework upon which those more advanced, data-rich shopping agents can be rigorously built and improved.

常见问题