Managed RAG in 15 minutes

15 min

A language model is confident about everything and certain about nothing. Ask it about your company's refund policy and it will happily invent one. RAG fixes that by doing something almost boringly simple: before the model answers, it goes and finds the actual passages from your own documents that look relevant, and pastes them into the question. The model isn't remembering your handbook. It's reading the three paragraphs that matter, right before it replies. The clever part is that you never see the plumbing. You upload a PDF, ask a question in plain English and get back an answer with the exact page it came from. In the next fifteen minutes you'll build that, and the only infrastructure you run is a file upload.

Prerequisites

• A Ringside account (sign up at ringside.fightclub.pro/register)
• Python 3.9+ with openai >= 1.40 installed
• A PDF you don't mind uploading. A product handbook, a research paper, anything.
• 15 minutes

Step 1

Get an API key + an assistant

// one-time setup

Mint an API key at ringside.fightclub.pro/app/api-keys and export it as FC_API_KEY. While you're in the dashboard, create an Assistant under /app/assistants with the instructions 'Answer using the supplied files. Cite the file_id and chunk index for every claim.' Copy its asst_ ID; we'll use it in Step 5.

export FC_API_KEY=fc_sk_live_...
export FC_ASSISTANT_ID=asst_...
pip install --upgrade openai

Step 2

Create a vector store

// one tenant per customer

One vector store per customer in your app is the standard pattern. The embedding_model is locked at create time but switchable later via the dashboard's migrate flow (your re-embed runs in the background from cached parses, you pay embedding tokens only).

from openai import OpenAI

client = OpenAI(
    base_url="https://api.fightclub.pro/v1",
    api_key=os.environ["FC_API_KEY"],
)

store = client.vector_stores.create(
    name="acme-handbook",
    embedding_model="text-embedding-3-small",
)
print("store id:", store.id)
# => store id: vs_a1b2c3d4...

Want the graph too? Pass graphrag_enabled: true at create (or PATCH it on later). With it on, ingest also builds a knowledge graph of the entities across your files; file_search then walks those links and returns a facts array alongside the usual chunk citations — useful when the answer is spread across documents the nearest-chunk match alone would miss. It bills as its own per-GB-day storage line (first 1 GB-day per store per day free); leave it off and you pay nothing extra. It's a Ringside extension, so pass it via the SDK's extra_body:

store = client.vector_stores.create(
    name="acme-handbook",
    embedding_model="text-embedding-3-small",
    extra_body={"graphrag_enabled": True},
)

Step 3

Upload a file + attach it to the store

// async ingest starts here

Upload returns a file ID synchronously. Attaching the file to the vector store kicks off the async ingest pipeline (parse + chunk + embed + index). The attach call returns immediately with status='pending'.

with open("handbook.pdf", "rb") as fp:
    file = client.files.create(file=fp, purpose="attachments")
print("file id:", file.id)
# => file id: file_xyz789...

vsf = client.vector_stores.files.create(
    vector_store_id=store.id,
    file_id=file.id,
)
print("vsf status:", vsf.status)
# => vsf status: pending

Step 4

Wait for ingest to finish

// poll, or subscribe to a webhook

For a tutorial we poll. In production, register a vector_store.file.completed webhook so your worker fires when the file is searchable. Ingest for a 30-page PDF lands in seconds. A 300-page corpus runs in a couple of minutes.

import time

while True:
    f = client.vector_stores.files.retrieve(
        vector_store_id=store.id,
        file_id=file.id,
    )
    print(f"  {f.status}", "" if not f.last_error else f.last_error)
    if f.status in ("completed", "failed", "cancelled"):
        break
    time.sleep(2)
# Expected progression: pending -> in_progress -> completed

Step 5

Ask a question via file_search

// Assistants run with the tool config

The retrieval call is an Assistants run with the file_search tool config pointing at your store. The assistant's instructions tell the model what to do with the retrieved chunks; the run does the embed-the-query, retrieve, stuff-into-context dance for you.

thread = client.beta.threads.create()
client.beta.threads.messages.create(
    thread_id=thread.id,
    role="user",
    content="What's the company-wide expense reporting cut-off?",
)

run = client.beta.threads.runs.create_and_poll(
    thread_id=thread.id,
    assistant_id=os.environ["FC_ASSISTANT_ID"],
    tools=[{
        "type": "file_search",
        "file_search": {"vector_store_ids": [store.id]},
    }],
    # Optional but recommended: attribute the call to the end-customer who triggered it
    extra_headers={"FC-Customer": "cus_42"},
)

messages = client.beta.threads.messages.list(thread_id=thread.id, order="desc", limit=1)
answer = messages.data[0]
print(answer.content[0].text.value)

Step 6

Read the citations

// from the query log, not message annotations

Retrieval records what it matched, and you read it back from the store's query log. Each logged query carries the chunks that were retrieved for it, with the file_id and chunk index behind every passage the model was shown. That is the audit trail: for any answer, you can show which passage of which file it came from.

# The store's query log records what retrieval actually matched.
queries = client.get(
    f"/vector_stores/{store.id}/queries",
    cast_to=dict,
)

for q in queries["data"][:1]:
    print("query:", q["query"])
    for chunk in q.get("chunks", []):
        print(f"  file_id={chunk['file_id']} chunk_index={chunk['chunk_index']}")
        src = client.files.retrieve(chunk["file_id"])
        print(f"    -> {src.filename}")

Heads up

Citations do not currently arrive as annotations on the assistant message. The text.annotations array on a returned message is always empty today — the retrieval citations live in the file_search tool output the model consumes, and in the per-store query log shown above. Message-level citation annotations are a tracked follow-up. Read citations from the query log until that ships, and your rendering code will keep working when it does.

What you just shipped

A customer uploads a file, your app attaches it to that customer's vector store, your app answers questions about the file and can show which passage each answer came from. The retrieval log, per-customer cost attribution, embedding model migration and the rest of the RAG plumbing live on our side; your code is the six steps above.

Next steps

· The RAG guide for why any of this works, one piece at a time.
· Citation-parsing recipe for the production-quality version of Step 6.
· Vector stores API reference for the full endpoint list (list/get/patch/delete, file batches, cancel/retry, queries, stats, migrate, rollback).
· RAG product page for the broader pitch and pricing.
· RAG pricing if you want to model your monthly cost before you scale.