The Two-Phase Query

A RAG query is two operations, glued together with a prompt:

  1. Retrieve. Embed the question (with task_type=RETRIEVAL_QUERY), then SELECT ... ORDER BY embedding <=> %s LIMIT K to get the K nearest chunks.
  2. Generate. Build a prompt that says: "Here is some context. Use only this context to answer the question. If the context doesn't contain the answer, say so." Send it to Gemini.

That's the whole game. Everything else — re-ranking, query rewriting, hybrid search — is optimization. Get the two-line version working first.

The Retrieval Function

# src/retrieve.py
"""Find the K most similar chunks to a question."""
 
from .db import get_conn
from .embeddings import embed
 
 
def retrieve(question: str, k: int = 5) -> list[dict]:
    """Return the K chunks closest to the question."""
    [query_vec] = embed([question], task="RETRIEVAL_QUERY")
    query_str = "[" + ",".join(f"{x:.7f}" for x in query_vec) + "]"
 
    with get_conn() as conn:
        cur = conn.cursor()
        cur.execute(
            """
            SELECT
              id,
              source,
              chunk_index,
              content,
              embedding <=> %s AS distance
            FROM chunks
            ORDER BY embedding <=> %s
            LIMIT %s
            """,
            (query_str, query_str, k),
        )
        rows = cur.fetchall()
 
    return [
        {
            "id": r[0],
            "source": r[1],
            "chunk_index": r[2],
            "content": r[3],
            "distance": float(r[4]),
        }
        for r in rows
    ]

A few things worth knowing:

  • <=> is cosine distance in pgvector. It pairs with the vector_cosine_ops operator class we built our HNSW index against. If you used <-> (L2) or <#> (negative inner product) you'd skip the index entirely.
  • The vector appears twice in the query — once in the SELECT (to return the distance) and once in the ORDER BY (to drive the index). Postgres won't dedupe that for you.
  • Lower distance = more similar. Cosine distance ranges from 0 (identical direction) to 2 (opposite). Anything under ~0.4 is usually relevant; over ~0.7 is usually noise.

Smoke Test the Retriever

# scripts/test_retrieve.py
from src.retrieve import retrieve
 
results = retrieve("What is retrieval-augmented generation?", k=3)
for r in results:
    print(f"[{r['distance']:.3f}] {r['source']}#{r['chunk_index']}")
    print(r['content'][:200], "\n")
uv run python -m scripts.test_retrieve

You should see three results, each with a distance, source filename, chunk index, and a preview. Eyeball the previews — they should look topically relevant. If they don't, your ingest data probably doesn't cover the question yet. Embed more text.

The Generation Step

Now the LLM. We use google-genai because it works against Vertex AI without a separate API key — same ADC that talks to Cloud SQL is what authenticates Gemini.

# src/generate.py
"""Ask Gemini to answer a question using only the provided context."""
 
from google import genai
from google.genai.types import GenerateContentConfig
 
from .config import load_config
 
_cfg = load_config()
_client = genai.Client(vertexai=True, project=_cfg.project, location=_cfg.region)
 
MODEL = "gemini-flash-latest"
 
SYSTEM_PROMPT = """You answer questions from the provided context.
Rules:
- Use only the context. Do not use outside knowledge.
- If the context does not contain the answer, reply exactly: "I don't know based on the provided documents."
- Quote short phrases from the context when helpful.
- Cite sources at the end as: Sources: <filename>#<chunk_index>, ...
"""
 
 
def generate_answer(question: str, chunks: list[dict]) -> str:
    context = "\n\n---\n\n".join(
        f"[{c['source']}#{c['chunk_index']}]\n{c['content']}" for c in chunks
    )
 
    prompt = (
        f"Context:\n{context}\n\n"
        f"Question: {question}\n\n"
        "Answer:"
    )
 
    response = _client.models.generate_content(
        model=MODEL,
        contents=prompt,
        config=GenerateContentConfig(
            system_instruction=SYSTEM_PROMPT,
            temperature=0.1,
        ),
    )
    return response.text or "I don't know based on the provided documents."

Why these choices:

  • gemini-flash-latest — fast and cheap, well-suited to RAG where the model is doing synthesis, not deep reasoning. Swap in gemini-pro-latest later if you want longer/sharper answers and don't mind paying ~10× more per query.
  • temperature=0.1 — RAG answers should be deterministic; you don't want creativity. Keep it just above 0 so the model doesn't get stuck.
  • Strict system prompt — every sentence in there is fighting a specific failure mode: hallucinating ("Use only the context"), making up a confident wrong answer ("If the context does not contain the answer..."), and unsourced claims ("Cite sources at the end").

The [<source>#<chunk_index>] format above the chunk is what teaches the model to cite. Gemini is good at echoing the format it sees in the context.

The Top-Level ask

One function that ties the two together. This is what the API endpoint will call.

# src/ask.py
"""End-to-end: question → retrieved chunks → grounded answer."""
 
from .generate import generate_answer
from .retrieve import retrieve
 
 
def ask(question: str, k: int = 5) -> dict:
    chunks = retrieve(question, k=k)
    answer = generate_answer(question, chunks)
    return {
        "question": question,
        "answer": answer,
        "sources": [
            {"source": c["source"], "chunk_index": c["chunk_index"], "distance": c["distance"]}
            for c in chunks
        ],
    }

Try It

# scripts/test_ask.py
import json
from src.ask import ask
 
print(json.dumps(ask("What is retrieval-augmented generation?"), indent=2))
uv run python -m scripts.test_ask

You should get something like:

{
  "question": "What is retrieval-augmented generation?",
  "answer": "Retrieval-augmented generation (RAG) is a technique that... [...] Sources: rag-wikipedia.txt#0, rag-wikipedia.txt#1",
  "sources": [
    {"source": "rag-wikipedia.txt", "chunk_index": 0, "distance": 0.142},
    {"source": "rag-wikipedia.txt", "chunk_index": 1, "distance": 0.218}
  ]
}

If you see "I don't know based on the provided documents." and the distances are above ~0.6, the retriever is failing to find good chunks — usually because the question is about something you didn't ingest. Try a question grounded in your sample text.

When the Answer Looks Wrong

Three knobs cover ~90% of RAG quality issues:

SymptomKnob
Model invents factsLower temperature, tighten system prompt, add "if unsure, say I don't know"
Model misses obvious answersIncrease k (try 8 or 10), check the chunk size in step 5 — small chunks fragment context
Top-K results aren't relevantRe-check task_type matches between ingest and query, increase chunk overlap

Don't reach for re-rankers or hybrid search until those three are dialed in. They almost always solve the problem on their own.

What You Have Now

  • A retrieval function backed by an HNSW index — millisecond latency even with hundreds of thousands of chunks
  • A generation function with a strict system prompt that forces grounding
  • A single ask() that does the whole thing in two API calls
  • A working end-to-end pipeline. Locally. As yourself.

Next: wrap it in HTTP and put it on the internet.


Reference: pgvector distance operators · google-genai SDK on Vertex AI · Gemini system instructions