Smarter prompts, better bots: Supercharging RAG with LMQL
)
At PyData Germany 2025 in Darmstadt, I shared some work I’ve been doing with one of the most underrated tools in the LLM space right now: LMQL. My talk explored how structured generation with LMQL can boost retrieval-augmented generation (RAG), especially when working with smaller, open-source models.
Wait, what’s LMQL?
LMQL (Language Model Query Language) is a domain-specific language that lets you take control of language model output in ways that go way beyond traditional prompting.
Instead of hoping your prompt gets the right result, LMQL lets you write structured “queries” that combine:
Prompt templates
Output constraints (like regex or token length)
Python logic (loops, conditionals, function calls)
It’s kind of like turning prompt engineering into real programming.
Some of the things LMQL makes easy:
Generating structured outputs like Pydantic or dataclass objects
Enforcing formatting rules with regex
Adding logic and control flow to prompts
Calling APIs and using tools inside your query
Rapid iteration on prompt design
A practical example: wiki + LMQL + API call
During my talk, I walked through a simple Wikipedia retriever example that shows how LMQL can guide generation and make an external API call, all in a single query:
@lmql.query(**lmql_kwargs)
async def norse_origins():
'''lmql
"Q: From which countries did the Norse originate?\n"
"Action: Let us search Wikipedia for the term '[TERM]\n"
where STOPS_AT(TERM, "'") and len(TOKENS(TERM)) < 50
wiki_result = await fetch_wikipedia(TERM)
"Result: {wiki_result}\n"
"Final Answer:[ANSWER]"
where len(TOKENS(ANSWER)) < 50 and STOPS_AT(ANSWER, ".")
'''
result = await norse_origins()
This query gets the model to suggest a search term, grabs a result from Wikipedia, and makes sure the final answer stays short and properly formatted.
LMQL gets even more powerful with structured output
You can also use LMQL to generate fully typed Python objects. Here's an example using Python dataclasses:
from dataclasses import dataclass
@dataclass
class Employer:
employer_company: str
location: str
@dataclass
class Person:
name: str
age: int
employer: Employer
job: str
@lmql.query(**lmql_kwargs)
async def pydantic_gen():
'''lmql
"Chris is a 31-year-old and works as an ML engineer at TheyDo in Nuremberg, Germany.\n"
"Structured: [PERSON_DATA]\n"
where type(PERSON_DATA) is Person and len(TOKENS(PERSON_DATA)) < 200
"His name is {PERSON_DATA.name} and he works in {PERSON_DATA.employer.location}."
'''
result = await pydantic_gen()
The result? You get structured, type-safe data that’s ready to plug into your app logic, with no post-processing hacks required.
How do we know it works? Meet Ragas.
To evaluate how well LMQL works in a basic RAG setup, I ran tests using Ragas, a toolkit for scoring RAG systems across three key metrics:
Context Recall – Did the retriever surface the right information?
Factual Correctness – Is the answer accurate compared to the ground truth?
Faithfulness – Does the answer actually come from the retrieved context?
Here’s a quick overview of what each metric looks at:
Factual Correctness: Measures overlap between the generated answer and ground truth. Hallucinations and omissions both reduce the score.
Context Recall: Checks whether the ground truth answer is supported by the retrieved context. If it isn’t, the retriever may be the weak link.
Faithfulness: Compares the generated answer to the context. If the model makes claims that are not grounded in the retrieved material, it’s hallucinating.
Both context recall and faithfulness use a similar technique (matching against retrieved context), but focus on different things:
Context recall = "Did we retrieve the right stuff to answer the question?"
Faithfulness = "Did the model stick to what we retrieved?"
And here’s why I especially like faithfulness: it doesn’t require labeled ground truth. You only need the model’s answer and the context to evaluate how truthful the output is.
Testing the performance gap
To keep things simple, I used a very basic setup: keyword search on Wikipedia, reranked by the LLM, and trimmed to the first 500 characters.
Back in late 2024, I used this same setup for the Amnesty International QA dataset with GPT-4. For this talk, I wanted to see how a fully open-source model would stack up, so I swapped in SmolLM-1.7B, a tiny, 4-bit quantized model.
Here’s how they did:
GPT-4:
Context Recall: 0.33
Factual Correctness: 0.32
Faithfulness: 0.72
SmolLM-1.7B:
Context Recall: 0.12
Factual Correctness: 0.27
Faithfulness: 0.09
The biggest gap? Faithfulness. SmolLM was far more likely to hallucinate or make stuff up, even when the retriever gave it decent context. And while the other scores were lower too, faithfulness is the one that really makes or breaks reliability in an RAG system.
Why this matters
RAG only works if the model stays grounded in the retrieved data. If the model ignores the context, or worse, fabricates a confident-sounding answer, then retrieval becomes pointless.
In sensitive domains like healthcare, finance, or legal, that’s not just a technical issue. It’s a trust and safety problem.
Can LMQL help?
That’s where structured generation comes in.
With LMQL, you can reduce hallucination by constraining what the model is allowed to generate. You can limit token length, enforce patterns, filter results, or require structured formats. You can even build in logic to reject low-confidence completions or retry when outputs fall outside your defined structure.
I also touched briefly on other ideas, like DSPy, which uses student–teacher architectures to guide models more effectively. But regardless of the tool, the main challenges with smaller models remain:
Weak reasoning abilities
Overconfident hallucinations (“fake confidence”)
Limited context windows
What this means for open-source LLMs
There’s no denying the performance gap, but open-source models still bring real advantages:
Cost efficiency
Full control and customization
No data sharing with third parties
Potential for on-device deployment
The trick is compensating for their limitations, and LMQL is a great step in that direction. By structuring generation more tightly, we can make small models act a lot more like their bigger, closed-source cousins.
Final thoughts
If you’re building with open-source LLMs, structured generation can be a game-changer. This experiment showed that even with a basic RAG setup and a tiny model, LMQL gave me ways to rein things in, test systematically, and extract more reliable output.
As open-source models improve and our toolkits get smarter, the gap will keep closing. And with the right techniques, even “Smol” models can punch above their weight.
Want to dive deeper?Check out the full PyData talk here.