How to Build a Personal Researcher: RAG on a Budget VPS

Want a private research assistant that you control?

You can build a personal researcher that answers questions from your notes, PDFs, web clips, and lecture files. You do not need expensive cloud services. With a small VPS, open models, and a vector store, you can run a Retrieval-Augmented Generation (RAG) stack that is private and cheap.

This guide shows a practical path. I focus on setups that work on very low-cost VPS plans (under ₹300/month). I explain trade-offs, give step-by-step commands and code snippets, and show search-quality tips for Indian languages. Ready to make a private RAG on a budget? Let’s begin.

The basic idea — what a budget RAG stack needs

A RAG system has three parts:

Storage and search — a vector database to keep embeddings and metadata (Qdrant or pgvector).
Embeddings — convert text chunks into vectors using a small, open model.
Generation — an LLM that consumes retrieved context and writes a final answer.

On a tiny VPS, you can run the vector DB and embedding jobs. For generation you have two options:

Use a small local open LLM (slow but private), or
Call a hosted open-model inference API for generation (cheap per call, keeps VPS small).

This hybrid approach keeps monthly VPS cost low while giving decent results.

Recommended VPS specs and cost trade-offs

Under ₹300/month you can usually get a VPS with:

1 vCPU
1–2 GB RAM
20–40 GB SSD

This is enough to host Qdrant or Postgres+pgvector, and to run embedding jobs in small batches. Running a modern LLM locally on such a machine is usually not practical. For generation, either use a quantized tiny model via llama.cpp (requires more RAM) or use a remote inference endpoint.

If you can spend a bit more later, a 2–4 GB machine helps with performance. Start cheap. Optimize.

Choose Qdrant or pgvector?

Both work. Quick guide:

Qdrant
Built as a vector DB.
Easy to run in Docker.
Good for simple deployments and fast search.
pgvector (Postgres + pgvector)
Uses Postgres with a vector extension.
Slightly lighter if you already use Postgres.
Good if you prefer SQL and want metadata queries in the same DB.

For beginners on a cheap VPS, Qdrant in Docker is usually the fastest path.

Step-by-step setup (Qdrant + embeddings + RAG)

Below is a minimal, reproducible flow. I assume a basic Linux VPS with apt and Docker available.

1) Prepare the VPS

Run these commands on your VPS as root or sudo user:

sudo apt update

sudo apt install -y python3 python3-venv python3-pip docker.io

sudo systemctl enable --now docker

Create a Python virtual environment for your scripts:

python3 -m venv rag-env

source rag-env/bin/activate

pip install --upgrade pip

2) Run Qdrant (Docker)

Start a small Qdrant instance:

docker run -d --name qdrant -p 6333:6333 qdrant/qdrant

This exposes the Qdrant API on port 6333. It works fine on low-memory VPS for small collections.

3) Install Python libraries

Install the minimal Python stack:

pip install qdrant-client sentence-transformers transformers requests fastapi uvicorn

qdrant-client to talk to Qdrant.
sentence-transformers for embeddings (small models).
transformers if you plan to run tiny local LLMs or encoder models.

4) Choose an embedding model (multilingual for Indian languages)

For Indian languages, pick a multilingual embedding model. Use a small sentence-transformers model to save RAM and CPU. Example name: a compact multilingual MiniLM variant. These models give good retrieval for Hindi, Tamil, Bengali, and mixed text.

5) Ingest documents, chunk, and embed

Chunking tips: use 300–500 token chunks with 50–100 token overlap. Smaller chunks help retrieval precision for short queries and for Indian languages with mixed scripts.

Example ingestion script (outline):

from sentence_transformers import SentenceTransformer

from qdrant_client import QdrantClient

from qdrant_client.http.models import Distance, VectorParams

import uuid, os, json

model = SentenceTransformer('your-multilingual-small-model') # choose small model

qdrant = QdrantClient(url='http://localhost:6333')

# create collection

qdrant.recreate_collection(collection_name='research', vectors_config=VectorParams(size=384, distance=Distance.COSINE))

def chunk_text(text):

# simple chunker by sentences; replace with smarter logic

sent = text.split('. ')

chunks, cur = [], ''

for s in sent:

if len(cur) + len(s) < 1500:

cur += s + '. '

else:

chunks.append(cur.strip())

cur = s + '. '

if cur:

chunks.append(cur.strip())

return chunks

def ingest_file(path, metadata):

text = open(path, encoding='utf-8').read()

chunks = chunk_text(text)

for i, c in enumerate(chunks):

vec = model.encode(c).tolist()

qdrant.upsert(collection_name='research', points=[{"id": str(uuid.uuid4()), "vector": vec, "payload": {"text": c, **metadata}}])

# example

ingest_file('notes/mypaper.txt', {'source':'mypaper', 'lang':'hi'})

Store useful metadata: source, lang, date, author. Use this for filtering and boosting later.

6) Build a retriever + generator pipeline

When a query arrives:

Embed the query with the same embedding model.
Use Qdrant search to get top K chunks (K=4 or 6).
Concatenate the chunks into a context block.
Send the user prompt + context to a generator (local small model or remote inference API).

Simple retrieval call:

hits = qdrant.search(collection_name='research', query_vector=query_vec, top=6)

context = "\n\n".join([h.payload['text'] for h in hits])

Generator prompt pattern:

You are a helpful research assistant. Use the context below to answer the question.

Context:

{context}

Question:

{user_question}

Answer concisely and cite which context line you used.

If you use a remote LLM API for generation, send this combined prompt. If you run a tiny LLM locally (llama.cpp based), keep prompts short.

Indian language tips — make search work for Hindi, Tamil, and more

Regional languages need extra care. Here are practical tips:

Use a multilingual embedding model. It keeps Hindi, Tamil, Bengali, and English vectors in the same space.
Normalize text: apply Unicode NFKC normalization, lowercasing (if appropriate), and remove zero-width chars.
Handle Romanized text: many users type Hindi in Latin script. Transliterate or run a quick transliteration pass to map Roman Hindi to Devanagari before embedding.
Stemming and stopwords: avoid aggressive stemming for Indian languages; rely on embeddings instead. Use light stopword removal only if it helps for your dataset.
Include metadata language tags: store lang in payload and prefer same-language chunks when responding. If a query is in Hindi, boost Hindi chunks.
Test mixed-language queries: South Asian queries often mix English + a regional language in one sentence. Keep a few such prompts in your test set.

Search quality tips — chunking, overlap, and reranking

Good retrieval matters more than any fancy LLM. Try these:

Chunk size: 200–500 tokens with overlap works well.
Overlap: 50–100 tokens reduces missing context at chunk edges.
K and rerank: fetch top 10 by vector similarity and rerank by a cheap lexical score (e.g., TF-IDF) or a cross-encoder if you have capacity. This improves precision for exact facts.
Metadata filters: if query mentions a date or source, filter results to match.
Recency boost: for news or event data, boost later timestamps.

Reranking can be done cheaply with a small cross-encoder on a separate machine or via a simple lexical filter on the VPS.

Cost-saving tricks

Batch embeddings: create embeddings on your desktop or on the VPS during off-peak hours.
Use small models for embeddings; embeddings cost less CPU.
Cache frequent queries and answers to avoid repeated generation calls.
Run generation remotely if local CPU is too slow. pay per-call rather than keeping a big VM.
Prune old vectors you rarely use to save storage and speed up search.

Privacy and backups

Keep sensitive data on your VPS only. Do not send private docs to public APIs if privacy is a must.
Back up your collection (Qdrant supports snapshot/export). Keep daily exports if the data matters.
Use simple access tokens and an SSH key. Do not open Qdrant publicly—use an SSH tunnel or reverse proxy with auth.

A simple testing checklist

Before you call it “done”, check:

Can the retriever find the exact paragraph for a target query?
Are answers citing the correct context chunk?
Do queries in Hindi return Hindi context?
Is latency acceptable (<2s for retrieval, <6s for generation if remote)?
Does the system refuse to invent facts for unknown queries? Add guardrails that show “I don’t know” when confidence is low.

Conclusion — Start small, grow wisely

You do not need a big budget to build a useful personal researcher. A cheap VPS under ₹300/month plus open models and Qdrant (or pgvector) gives you a private RAG system. Use small multilingual embedding models for Indian languages. Keep chunking tight. Use a hybrid approach: local vector DB + remote small generation for the best cost-performance trade-off.

Ready to try this week? Pick a small set of documents — lecture notes or a few PDFs — and run the ingestion script. Then ask your first question. You will be surprised how handy a private researcher can be.