Building the AI quality flywheel: How TheyDo turns user feedback into better AI

Chris Swart · Senior Machine Learning Engineer
Building_the_AI_quality_flywheel_Blog

Businesses are buzzing about AI and its potential to revolutionize everything from operations and decision-making to the bottom line. At TheyDo, we’re on board — but we didn’t just drink the Kool-Aid. We put AI to work where it matters.

Our Journey AI transforms unstructured customer data into actionable insights, revealing hidden opportunities and pain points across user journeys. We process multiple input sources to generate valuable outputs that enterprises rely on for strategic decisions. But those insights are only as good as the quality of the AI behind them. A journey riddled with duplicate steps, vague insights, or unnecessary complexity doesn’t just frustrate users — it erodes trust. And in AI, trust is everything.

TheyDo

That’s why we built a structured, self-improving system that turns AI evaluation from a black box into a quality flywheel, ensuring every insight helps you make better decisions, faster — and with confidence.

Here’s how we did it:

The quality challenge: Cracking the measurement code

AI-generated content must be accurate, structured, and meaningful — but ensuring consistent quality at scale is anything but straightforward. AI outputs can vary unpredictably, and what looks good on the surface may fail to provide real value.

So, how do you measure and maintain the quality of AI-generated content at scale?

To tackle this challenge, we combined multiple approaches to ensure AI-generated content meets the highest standards. By leveraging user feedback, automated guardrails, and AI-driven evaluation, we created a system that continuously refines and improves itself. Here’s how each method plays a critical role:

1. User feedback: The gold standard (but too slow)

The best feedback comes directly from our users. When they tell us, “This journey is amazing” or “These steps are confusing”, we get powerful qualitative insights. The problem? Feedback is sporadic, subjective, and arrives too late to prevent poor experiences.

2. Automated guardrails: Catching obvious errors

We implemented automated tests to flag common AI missteps:

  • Duplicate steps or phases

  • Excessive step counts

  • Placeholder text in outputs 

These guardrails acted as a first line of defense, preventing glaring errors before they reached users. But they couldn’t evaluate more subjective elements like clarity, coherence, or value.

3. LLM-as-Judge: Scaling AI evaluation

To address the shortfalls of our first two methods, we introduced an innovative approach: using AI to evaluate AI. We built an LLM-as-Judge system, leveraging large language models (LLMs) to assess the quality of AI-generated journey maps. This allowed us to:

  • Correlate AI-generated scores with real user feedback

  • Scale evaluation beyond what human reviewers could handle

  • Understand how model tweaks impacted performance

The technical approach

To effectively evaluate and improve our AI-generated outputs, we needed an LLM evaluation framework that met five key requirements:

  • Self-hostable for data privacy

  • Robust production monitoring to track performance

  • Comprehensive API/SDK for seamless integration

  • Visual prompt comparison tools for testing and refinement

  • Experiment management for prompt versioning and iteration

Evaluating our options

We assessed five potential solutions:

  • Braintrust (Closed-source LLM engineering platform)

  • MLflow (Open-source MLOps platform)

  • Chainforge (Visual prompt evaluator)

  • Langfuse (Open-source LLM engineering platform)

  • A custom-built solution on our Honeycomb infrastructure

Feature comparison

Key featureBraintrustMLflowChainforgeLangfuseCustom solution
Self-hosting
Ease of setup⚠️ Moderate🔴 Complex🟢 Simple⚠️ Moderate🔴 Complex
Comprehensive API⚠️ Limited
Visual prompt tools⚠️ Requires dev
Evaluation capabilities✅ Advanced✅ Advanced⚠️ Basic✅ Advanced⚠️ Custom dev
Prompt management⚠️ Requires dev

Click to see detailed comparison (full 24-criteria evaluation)

LLM Eval ToolsBraintrustMLflowChainforgeLangfuseCustom Solution (Honeycomb)
Core Capabilities
Self-hosting
UI for visualizations⚠️ Requires development
Programmatic API/SDK⚠️ Limited
Production monitoring⚠️ Limited
Open source⚠️ only auto-evals
Evaluation Methods
LLM-as-a-Judge⚠️ Basic⚠️ Requires implementation
Heuristic metrics⚠️ Limited⚠️ Requires implementation
Statistical metrics⚠️ Limited⚠️ Requires implementation
Experiment Management
Prompt versioning⚠️ Requires implementation
A/B testing⚠️ Requires implementation
Experiment tracking⚠️ Limited⚠️ Via Honeycomb
Dataset management⚠️ Requires implementation
Observability
Tracing
Metrics collection⚠️ Limited
Cost tracking⚠️ Limited
Latency monitoring
Integration Capabilities
Multiple LLM providers
Integration with RAG⚠️ Limited⚠️ Requires implementation
User Experience
Setup complexity⚠️ Moderate🔴 High🟢 Low⚠️ Moderate🔴 High
Visual prompt comparison⚠️ Requires development
Team collaboration⚠️ Limited⚠️ Via other tools
Learning curve⚠️ Moderate🔴 Steep🟢 Gentle⚠️ Moderate🔴 Steep

And the winner is — Langfuse

We chose Langfuse because it hit the sweet spot for our needs. It lets us keep our data in-house (unlike Braintrust), has good monitoring tools, and makes managing experiments straightforward. It's easier to set up than MLflow and more full-featured than Chainforge.

The ML flywheel: A continuous improvement loop

TheyDo

The most impactful result of our LLM evaluation system is the creation of a continuous improvement loop — what we call our ML Flywheel. This self-reinforcing system ensures that every iteration enhances AI quality and reliability. At its core, the flywheel is built on four key components:

  1. AI-generated content: Journey AI generates insights based on current prompts and models.

  2. User feedback: Users interact with the generated content, offering explicit (comments, ratings) and implicit (engagement patterns) feedback.

  3. AI evaluation: Automated guardrails and LLM-as-Judge tools assess quality, correlating results with real user feedback.

  4. Better prompts: Evaluation insights feed directly into prompt improvements, refining future AI outputs.

This creates a virtuous cycle: better prompts → better AI outputs → better user feedback → stronger evaluation data → better prompts.

Lessons learned: What it takes to get AI right

Through this process, we uncovered several key insights:

  1. Combine human + AI evaluation

    • Rule-based checks catch obvious errors, while LLM-as-Judge handles subjective quality assessments.

  2. Validate against real feedback

    • AI evaluations must align with actual user reactions to remain meaningful.

  3. Customize for your use case

    • Generic benchmarks aren’t enough — evaluators must target your specific failure modes.

  4. Test at multiple levels

    • Some problems only appear with real-world usage, requiring both development-time and production-time testing.

  5. Make evaluation an integral part of AI development

    • Continuous feedback loops, not just final validation, drive real AI quality improvement.

What’s next?

The AI flywheel doesn’t stop spinning. Our next steps include:

  • Automating prompt iteration based on historical user feedback and LLM-as-Judge scores.

  • Strengthening AI-to-user feedback loops to accelerate improvements.

  • Exploring new evaluation tools like DSPy for more sophisticated prompt tuning.

By committing to a systematic, AI-driven approach to quality, TheyDo ensures that every AI-generated journey insight meets the highest standards — helping businesses make better decisions, faster. Stay tuned for the next installment of our AI flywheel series, where we’ll take a deeper dive into Langfuse and how it powers our AI-driven insights.

Ready to see Journey AI in action?

Discover how TheyDo's Journey AI can transform your customer data into clear, actionable insights. Start your free trial now.