25 March 2025

Building the AI quality flywheel: How TheyDo turns user feedback into better AI

Chris Swart · Senior Machine Learning Engineer

Businesses are buzzing about AI and its potential to revolutionize everything from operations and decision-making to the bottom line. At TheyDo, we’re on board — but we didn’t just drink the Kool-Aid. We put AI to work where it matters.

Our Journey AI transforms unstructured customer data into actionable insights, revealing hidden opportunities and pain points across user journeys. We process multiple input sources to generate valuable outputs that enterprises rely on for strategic decisions. But those insights are only as good as the quality of the AI behind them. A journey riddled with duplicate steps, vague insights, or unnecessary complexity doesn’t just frustrate users — it erodes trust. And in AI, trust is everything.

That’s why we built a structured, self-improving system that turns AI evaluation from a black box into a quality flywheel, ensuring every insight helps you make better decisions, faster — and with confidence.

Here’s how we did it:

The quality challenge: Cracking the measurement code

AI-generated content must be accurate, structured, and meaningful — but ensuring consistent quality at scale is anything but straightforward. AI outputs can vary unpredictably, and what looks good on the surface may fail to provide real value.

So, how do you measure and maintain the quality of AI-generated content at scale?

To tackle this challenge, we combined multiple approaches to ensure AI-generated content meets the highest standards. By leveraging user feedback, automated guardrails, and AI-driven evaluation, we created a system that continuously refines and improves itself. Here’s how each method plays a critical role:

1. User feedback: The gold standard (but too slow)

The best feedback comes directly from our users. When they tell us, “This journey is amazing” or “These steps are confusing”, we get powerful qualitative insights. The problem? Feedback is sporadic, subjective, and arrives too late to prevent poor experiences.

2. Automated guardrails: Catching obvious errors

We implemented automated tests to flag common AI missteps:

Duplicate steps or phases
Excessive step counts
Placeholder text in outputs

These guardrails acted as a first line of defense, preventing glaring errors before they reached users. But they couldn’t evaluate more subjective elements like clarity, coherence, or value.

3. LLM-as-Judge: Scaling AI evaluation

To address the shortfalls of our first two methods, we introduced an innovative approach: using AI to evaluate AI. We built an LLM-as-Judge system, leveraging large language models (LLMs) to assess the quality of AI-generated journey maps. This allowed us to:

Correlate AI-generated scores with real user feedback
Scale evaluation beyond what human reviewers could handle
Understand how model tweaks impacted performance

The technical approach

To effectively evaluate and improve our AI-generated outputs, we needed an LLM evaluation framework that met five key requirements:

Self-hostable for data privacy
Robust production monitoring to track performance
Comprehensive API/SDK for seamless integration
Visual prompt comparison tools for testing and refinement
Experiment management for prompt versioning and iteration

Evaluating our options

We assessed five potential solutions:

Braintrust (Closed-source LLM engineering platform)
MLflow (Open-source MLOps platform)
Chainforge (Visual prompt evaluator)
Langfuse (Open-source LLM engineering platform)
A custom-built solution on our Honeycomb infrastructure

Feature comparison

Key feature	Braintrust	MLflow	Chainforge	Langfuse	Custom solution
Self-hosting	❌	✅	✅	✅	✅
Ease of setup	⚠️ Moderate	🔴 Complex	🟢 Simple	⚠️ Moderate	🔴 Complex
Comprehensive API	✅	✅	⚠️ Limited	✅	✅
Visual prompt tools	✅	✅	✅	✅	⚠️ Requires dev
Evaluation capabilities	✅ Advanced	✅ Advanced	⚠️ Basic	✅ Advanced	⚠️ Custom dev
Prompt management	✅	✅	❌	✅	⚠️ Requires dev

Click to see detailed comparison (full 24-criteria evaluation)▼

LLM Eval Tools	Braintrust	MLflow	Chainforge	Langfuse	Custom Solution (Honeycomb)
Core Capabilities
Self-hosting	❌	✅	✅	✅	✅
UI for visualizations	✅	✅	✅	✅	⚠️ Requires development
Programmatic API/SDK	✅	✅	⚠️ Limited	✅	✅
Production monitoring	✅	✅	⚠️ Limited	✅	✅
Open source	⚠️ only auto-evals	✅	✅	✅	✅
Evaluation Methods
LLM-as-a-Judge	✅	✅	⚠️ Basic	✅	⚠️ Requires implementation
Heuristic metrics	✅	✅	⚠️ Limited	✅	⚠️ Requires implementation
Statistical metrics	✅	✅	❌	⚠️ Limited	⚠️ Requires implementation
Experiment Management
Prompt versioning	✅	✅	❌	✅	⚠️ Requires implementation
A/B testing	✅	✅	✅	✅	⚠️ Requires implementation
Experiment tracking	✅	✅	⚠️ Limited	✅	⚠️ Via Honeycomb
Dataset management	✅	✅	❌	✅	⚠️ Requires implementation
Observability	✅	✅		✅
Tracing	✅	✅	❌	✅	✅
Metrics collection	✅	✅	⚠️ Limited	✅	✅
Cost tracking	✅	⚠️ Limited	❌	✅	✅
Latency monitoring	✅	✅	❌	✅	✅
Integration Capabilities
Multiple LLM providers	✅	✅	✅	✅	✅
Integration with RAG	⚠️ Limited	✅	❌	✅	⚠️ Requires implementation
User Experience
Setup complexity	⚠️ Moderate	🔴 High	🟢 Low	⚠️ Moderate	🔴 High
Visual prompt comparison	✅	✅	✅	✅	⚠️ Requires development
Team collaboration	✅	✅	⚠️ Limited	✅	⚠️ Via other tools
Learning curve	⚠️ Moderate	🔴 Steep	🟢 Gentle	⚠️ Moderate	🔴 Steep

And the winner is — Langfuse

We chose Langfuse because it hit the sweet spot for our needs. It lets us keep our data in-house (unlike Braintrust), has good monitoring tools, and makes managing experiments straightforward. It's easier to set up than MLflow and more full-featured than Chainforge.

The ML flywheel: A continuous improvement loop

The most impactful result of our LLM evaluation system is the creation of a continuous improvement loop — what we call our ML Flywheel. This self-reinforcing system ensures that every iteration enhances AI quality and reliability. At its core, the flywheel is built on four key components:

AI-generated content: Journey AI generates insights based on current prompts and models.
User feedback: Users interact with the generated content, offering explicit (comments, ratings) and implicit (engagement patterns) feedback.
AI evaluation: Automated guardrails and LLM-as-Judge tools assess quality, correlating results with real user feedback.
Better prompts: Evaluation insights feed directly into prompt improvements, refining future AI outputs.

This creates a virtuous cycle: better prompts → better AI outputs → better user feedback → stronger evaluation data → better prompts.

Lessons learned: What it takes to get AI right

Through this process, we uncovered several key insights:

Combine human + AI evaluation
- Rule-based checks catch obvious errors, while LLM-as-Judge handles subjective quality assessments.
Validate against real feedback
- AI evaluations must align with actual user reactions to remain meaningful.
Customize for your use case
- Generic benchmarks aren’t enough — evaluators must target your specific failure modes.
Test at multiple levels
- Some problems only appear with real-world usage, requiring both development-time and production-time testing.
Make evaluation an integral part of AI development
- Continuous feedback loops, not just final validation, drive real AI quality improvement.

What’s next?

The AI flywheel doesn’t stop spinning. Our next steps include:

Automating prompt iteration based on historical user feedback and LLM-as-Judge scores.
Strengthening AI-to-user feedback loops to accelerate improvements.
Exploring new evaluation tools like DSPy for more sophisticated prompt tuning.

By committing to a systematic, AI-driven approach to quality, TheyDo ensures that every AI-generated journey insight meets the highest standards — helping businesses make better decisions, faster. Stay tuned for the next installment of our AI flywheel series, where we’ll take a deeper dive into Langfuse and how it powers our AI-driven insights.

Ready to see Journey AI in action?

Discover how TheyDo's Journey AI can transform your customer data into clear, actionable insights. Start your free trial now.