Smarter prompts, better bots: Supercharging RAG with LMQL

Chris Swart · Senior Machine Learning Engineer
    Smarter prompts, better bots: Supercharging RAG with LMQL

    At PyData Germany 2025 in Darmstadt, I shared some work I’ve been doing with one of the most underrated tools in the LLM space right now: LMQL. My talk explored how structured generation with LMQL can boost retrieval-augmented generation (RAG), especially when working with smaller, open-source models.

    Wait, what’s LMQL?

    LMQL (Language Model Query Language) is a domain-specific language that lets you take control of language model output in ways that go way beyond traditional prompting.

    Instead of hoping your prompt gets the right result, LMQL lets you write structured “queries” that combine:

    • Prompt templates

    • Output constraints (like regex or token length)

    • Python logic (loops, conditionals, function calls)

    It’s kind of like turning prompt engineering into real programming.

    Some of the things LMQL makes easy:

    • Generating structured outputs like Pydantic or dataclass objects

    • Enforcing formatting rules with regex

    • Adding logic and control flow to prompts

    • Calling APIs and using tools inside your query

    • Rapid iteration on prompt design

    A practical example: wiki + LMQL + API call

    During my talk, I walked through a simple Wikipedia retriever example that shows how LMQL can guide generation and make an external API call, all in a single query:

    @lmql.query(**lmql_kwargs)
    
    async def norse_origins():
    
        '''lmql
    
        "Q: From which countries did the Norse originate?\n"
    
        "Action: Let us search Wikipedia for the term '[TERM]\n" 
    
            where STOPS_AT(TERM, "'") and len(TOKENS(TERM)) < 50
    
        wiki_result = await fetch_wikipedia(TERM)
    
        "Result: {wiki_result}\n"
    
        "Final Answer:[ANSWER]" 
    
            where len(TOKENS(ANSWER)) < 50 and STOPS_AT(ANSWER, ".")
    
        '''
    
    result = await norse_origins()

    This query gets the model to suggest a search term, grabs a result from Wikipedia, and makes sure the final answer stays short and properly formatted.

    LMQL gets even more powerful with structured output

    You can also use LMQL to generate fully typed Python objects. Here's an example using Python dataclasses:

    from dataclasses import dataclass
    
    @dataclass
    
    class Employer:
    
        employer_company: str
    
        location: str
    
    @dataclass
    
    class Person:
    
        name: str
    
        age: int
    
        employer: Employer
    
        job: str
    
    @lmql.query(**lmql_kwargs)
    
    async def pydantic_gen():
    
        '''lmql
    
        "Chris is a 31-year-old and works as an ML engineer at TheyDo in Nuremberg, Germany.\n"
    
        "Structured: [PERSON_DATA]\n" 
    
            where type(PERSON_DATA) is Person and len(TOKENS(PERSON_DATA)) < 200
    
        "His name is {PERSON_DATA.name} and he works in {PERSON_DATA.employer.location}."
    
        '''
    
    result = await pydantic_gen()

    The result? You get structured, type-safe data that’s ready to plug into your app logic, with no post-processing hacks required.

    How do we know it works? Meet Ragas.

    To evaluate how well LMQL works in a basic RAG setup, I ran tests using Ragas, a toolkit for scoring RAG systems across three key metrics:

    1. Context Recall – Did the retriever surface the right information?

    2. Factual Correctness – Is the answer accurate compared to the ground truth?

    3. Faithfulness – Does the answer actually come from the retrieved context?

    Here’s a quick overview of what each metric looks at:

    • Factual Correctness: Measures overlap between the generated answer and ground truth. Hallucinations and omissions both reduce the score.

    • Context Recall: Checks whether the ground truth answer is supported by the retrieved context. If it isn’t, the retriever may be the weak link.

    • Faithfulness: Compares the generated answer to the context. If the model makes claims that are not grounded in the retrieved material, it’s hallucinating.

    Both context recall and faithfulness use a similar technique (matching against retrieved context), but focus on different things:

    • Context recall = "Did we retrieve the right stuff to answer the question?"

    • Faithfulness = "Did the model stick to what we retrieved?"

    And here’s why I especially like faithfulness: it doesn’t require labeled ground truth. You only need the model’s answer and the context to evaluate how truthful the output is.

    Testing the performance gap

    To keep things simple, I used a very basic setup: keyword search on Wikipedia, reranked by the LLM, and trimmed to the first 500 characters.

    Back in late 2024, I used this same setup for the Amnesty International QA dataset with GPT-4. For this talk, I wanted to see how a fully open-source model would stack up, so I swapped in SmolLM-1.7B, a tiny, 4-bit quantized model.

    Here’s how they did:

    GPT-4:

    • Context Recall: 0.33

    • Factual Correctness: 0.32

    • Faithfulness: 0.72

    SmolLM-1.7B:

    • Context Recall: 0.12

    • Factual Correctness: 0.27

    • Faithfulness: 0.09

    The biggest gap? Faithfulness. SmolLM was far more likely to hallucinate or make stuff up, even when the retriever gave it decent context. And while the other scores were lower too, faithfulness is the one that really makes or breaks reliability in an RAG system.

    Why this matters

    RAG only works if the model stays grounded in the retrieved data. If the model ignores the context, or worse, fabricates a confident-sounding answer, then retrieval becomes pointless.

    In sensitive domains like healthcare, finance, or legal, that’s not just a technical issue. It’s a trust and safety problem.

    Can LMQL help?

    That’s where structured generation comes in.

    With LMQL, you can reduce hallucination by constraining what the model is allowed to generate. You can limit token length, enforce patterns, filter results, or require structured formats. You can even build in logic to reject low-confidence completions or retry when outputs fall outside your defined structure.

    I also touched briefly on other ideas, like DSPy, which uses student–teacher architectures to guide models more effectively. But regardless of the tool, the main challenges with smaller models remain:

    1. Weak reasoning abilities

    2. Overconfident hallucinations (“fake confidence”)

    3. Limited context windows

    What this means for open-source LLMs

    There’s no denying the performance gap, but open-source models still bring real advantages:

    • Cost efficiency

    • Full control and customization

    • No data sharing with third parties

    • Potential for on-device deployment

    The trick is compensating for their limitations, and LMQL is a great step in that direction. By structuring generation more tightly, we can make small models act a lot more like their bigger, closed-source cousins.

    Final thoughts

    If you’re building with open-source LLMs, structured generation can be a game-changer. This experiment showed that even with a basic RAG setup and a tiny model, LMQL gave me ways to rein things in, test systematically, and extract more reliable output.

    As open-source models improve and our toolkits get smarter, the gap will keep closing. And with the right techniques, even “Smol” models can punch above their weight.

    Want to dive deeper?Check out the full PyData talk here.

    Chris Swart · Senior Machine Learning Engineer