Chatbot with RAG?

When should you consider using RAG, and When Is a plain LLM call enough? One of the easiest mistakes in AI app development is adding too much complexity too early.

You have an idea for a chatbot. You read about embeddings, vector databases, retrieval pipelines, chunking strategies, reranking, evals, and suddenly the “simple chatbot” has turned into an information retrieval system with a personality bolted on top.

Sometimes that is exactly the right architecture.

Often, it is not.

In a hobby project or a small app, the right first question is usually not “How do I add RAG?” but “What does my chatbot actually need to know that the base model does not already know?”

That distinction matters more than most people think.

Why build your own chatbot at all?

Because it is one of the most educational and creatively rewarding things you can build.

A custom chatbot sits at a very interesting intersection: product design, systems engineering, UX writing, and applied AI. It is not just about getting a model to respond. It is about shaping an experience.

You get to decide what the assistant feels like, how it speaks, what it should never say, how much context it remembers, when it should be humble (or whatever other personality traits you want it to manifest), when it should ask a question, and what kind of product it becomes in someone’s hands.

That is the fun part.

The useful part is that a custom chatbot teaches you where the real complexity lives. It is rarely in calling the model API. It is in everything around it: tone, constraints, retrieval, grounding, failure modes, cost, and latency.

With Boon - the mascot developed for the app Sathu, the goal was not to build simply “a chatbot.” It was to build a very specific presence: a warm, humble, young monk-in-training who could speak about Thai Buddhist culture and practice without sounding like a search engine or an academic textbook.

000102 boon thank you

That immediately made one thing clear:

Personality and retrieval are different systems.

The model voice comes from prompt design. The factual grounding comes from retrieval, if needed.

Those should not be confused.

Start with plain LLM

If your app mostly needs:

good conversation
a strong tone or personality
basic reasoning
summarization
rewriting
lightweight coaching
structured output from the user’s own latest message then you often do not need RAG yet.

A plain model call is dramatically simpler to build, debug, and tune. You can move faster, learn faster, and see whether the product is even interesting before you build a retrieval layer.

This is especially true if the value of the app is not “having proprietary facts,” but rather “how the assistant behaves.”

That was true for Boon too. A lot of the product feeling came from the system prompt, not only from retrieval. The prompt defined language behavior, politeness, humility, answer length, and even what kinds of internal mechanics should never be exposed to the user.

That alone gets you surprisingly far.

Add RAG when the model needs external knowledge RAG becomes worth it when the assistant must reliably answer with knowledge that is not safely resident in the base model. For the Sathu app with Boon, the special flavor of buddhism and thai culture combined could hardly be relient on pure LLM responses alone.

We had to go a layer deeper in how to design a suitable chatbot for this product.

These are general things to evaluate for your own product when facing the same question:

First, you have private or domain-specific content. Maybe product docs, internal notes, or structured knowledge that is yours.

Second, you need freshness or specificity. The model may know about your field in general, but not your exact temple text, your exact wording, or your curated source material.

Third, you want consistency around a bounded knowledge set. Not just plausible answers. Answers that are anchored in material you selected.

That is where retrieval starts to earn its keep.

The wrong reason to add RAG is that it feels “more advanced.” RAG is not a status upgrade. It is a memory and knowledge mechanism. If you do not have a real knowledge problem, it quickly becomes just extra moving parts.

002047 simulator screenshot iphone 17 pro max 2026 03 30 at 15 04 50

What a lightweight RAG implementation can look like

The nice thing is that “RAG” does not have to mean “huge architecture.”

In Boon, the retrieval path was intentionally lightweight.

We used:

gpt-5-mini for chat
text-embedding-3-small for embeddings
Supabase plus pgvector for vector storage
a simple SQL RPC for matching chunks
cosine similarity with a small gating rule before retrieval context was used

The rag.chunks table looked something like: - chunk text - token count - a vector(1536) embedding - language - topic - optional temple id

On lookup, we embedded the user message, queried the nearest chunks, filtered by fields like lang, topic, and temple_id, and pulled back the top matches. In our case, the default was topK = 6, capped at 8.

Then came the important part: not to blindly stuff retrieved text into every answer.

We only used RAG if two conditions were true:

the message was long enough to be worth embedding
the best retrieved match crossed a minimum similarity threshold

In the Boon function, those gates were very simple: - minimum message length for RAG: 8 characters - minimum max similarity to use RAG: 0.40

That is not magic. It is just a pragmatic first version.

When should you embed the user’s message?

This is where hobby projects often overcomplicate things.

You do not need a perfect classifier on day one. You need a decent heuristic.

A good first rule is: do not retrieve for messages that are obviously too short or too generic.

“Hi.” “Thanks.” “How are you?” “Tell me more.”

These rarely justify an embedding lookup.

In Boon, the first gate was simply message length. If the message was under 8 characters, skip RAG entirely. That alone removes a lot of waste.

After that, I would think in terms of intent.

Messages that probably deserve retrieval:

specific factual questions
references to named places, entities, or concepts in your dataset
“what is the history of…”
domain-specific follow-ups

Messages that often do not: - pure social back-and-forth - generic conversational filler - questions about “general information”

At hobby scale, embedding every eligible message is often fine. The bigger question is not whether you can afford it. It is whether retrieval is helping answer quality.

That is why debug metrics matter. In Boon, our function tracked things like: - chunksUsed - avgScore - maxScore - whether useRag was actually triggered That is the beginning of an eval loop.

If low-score retrieval keeps firing, your threshold is too low. If good factual questions never get grounded, your threshold is too high. If short conversational turns keep getting embedded, your gating is too loose.

How to bake retrieved knowledge into the final answer

This part is subtle.

The goal is not to make the model quote chunks. The goal is to let the model absorb them and then answer naturally in character.

RAG should improve recall, not leak implementation.

A useful mental model is this:

The system prompt defines who the assistant is. The retrieved chunks define what the assistant can privately draw from. The final answer should sound like one coherent mind.

If those layers are not aligned, the result feels stitched together. You get responses that suddenly sound like pasted documentation, or worse, responses that mention “the source material says…” and break the illusion completely.

The practical rule A chatbot is not better because it uses embeddings. It is better when it knows the right thing, says it in the right voice, and does so with the least amount of machinery needed.

Iterate on your personality.md and your system_prompt.md

Have fun!