Core Architecture Concepts in RAG, LLMs & GenAI

1. Embeddings

What it is: A dense vector representation of data (e.g., words, sentences, code).
Why it matters: Converts discrete data (like text) into continuous numerical space that models can process.
Example:
- “Dog” → [0.25, -0.12, ..., 0.83]
- Words with similar meanings have vectors close in space (semantic similarity).
Used in:
- Semantic search in RAG
- Input for LLMs
- Vector databases

2. Vector Spaces

What it is: A high-dimensional space where embeddings live.
Why it matters: Vectors allow fast similarity search using measures like cosine similarity or dot product.
Used in:
- Finding relevant documents in RAG
- Nearest neighbor searches in FAISS or similar vector DBs

3. Attention Mechanism

What it is: A technique that allows the model to focus on relevant parts of the input sequence when producing output.
Types:
- Self-attention: Used in Transformers; compares all tokens in a sequence to each other.
- Cross-attention: Used in RAG; queries from LLM attend to retrieved documents.
Why it matters:
- Solves long-range dependency problems in sequences.
- Enables parallelism (vs. RNNs).
Key math:
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

4. Transformers

What it is: The architecture underlying modern LLMs.
Components:
- Input Embedding + Positional Encoding
- Multi-head Attention
- Feed-forward Neural Networks
- Layer Normalization
- Residual Connections
Why it matters: Allows LLMs to scale, understand context, and generate coherent text.

5. Large Language Models (LLMs)

What it is: Neural networks (typically Transformers) trained on massive corpora to predict and generate human-like language.
Examples: GPT, BERT, Claude, Gemini
Key Traits:
- Pretraining: On vast text data using next-token prediction or masked language modeling.
- Fine-tuning: For specific tasks (e.g., chat, summarization).
- Inference: Generates text one token at a time using learned probabilities.

6. Generative AI (GenAI)

What it is: Any AI model that can generate new content (text, images, code, etc.).
In NLP:
- Models that produce novel text based on prompts or questions.
- LLMs are a subset of GenAI.
Modalities:
- Text (GPT, Claude)
- Code (Codex)
- Images (DALL·E, Midjourney)
- Video (Sora)
- Audio (MusicGen)

7. Retrieval-Augmented Generation (RAG)

What it is: A hybrid GenAI method that augments LLMs with retrieval from external knowledge.
Flow:
1. Embed Query → vector space
2. Retrieve Documents → from vector DB using similarity search
3. Augment Prompt → LLM receives query + retrieved context
4. Generate Answer → grounded, up-to-date, accurate
Why it matters:
- Reduces hallucination
- Enables up-to-date, domain-specific responses
- Keeps LLMs smaller and more efficient (vs. training on entire domain data)

8. Tokenization

What it is: Breaking text into tokens (smaller pieces) before inputting into a model.
Example:
- “ChatGPT is smart.” → [‘Chat’, ‘G’, ‘PT’, ‘ is’, ‘ smart’, ‘.’]
Why it matters:
- LLMs operate on tokens, not raw text.
- Affects context length and cost.

9. Context Window

What it is: The maximum number of tokens a model can consider at once.
LLMs have limits (e.g., GPT-4 can handle 128k tokens).
Why it matters: Limits how much data (prompt + docs) you can include during RAG.

10. Prompt Engineering

What it is: Crafting input prompts to guide the LLM’s behavior.
In RAG: Used to incorporate retrieved documents properly.
Example:
You are a Java expert. Based on the following context, answer the user’s question. Context: [...]. Question: [...]

11.
Vector Databases
- What it is: Specialized databases that store and search high-dimensional vectors.
- Popular tools: FAISS, Pinecone, Weaviate, Qdrant
- Role in RAG:
  - Store document embeddings
  - Retrieve semantically relevant docs during generation
12.
Similarity Search
- What it is: Finding vectors in the database closest to the query vector.
- Common Metrics:
  - Cosine Similarity
  - Dot Product
  - Euclidean Distance
13.
Fine-tuning vs. Prompting vs. RAG
Technique
When to Use
Fine-tuning
You want model to learn new tasks from scratch
Prompting
Quick instructions using existing model knowledge
RAG
Inject external, non-memorized knowledge

Technique	When to Use
Fine-tuning	You want model to learn new tasks from scratch
Prompting	Quick instructions using existing model knowledge
RAG	Inject external, non-memorized knowledge

┌─────────────┐

│ User Query │

└─────┬───────┘

│

▼

┌──────────────┐

│ Embed Query │

└─────┬────────┘

▼

┌─────────────────────┐

│ Vector DB Search │ ←— uses cosine similarity

└─────┬───────────────┘

▼

┌───────────────────────┐

│ Retrieved Documents │

└─────┬─────────────────┘

▼

┌────────────────────────────┐

│ Prompt + Retrieved Context │

└─────┬──────────────────────┘

▼

┌────────────────┐

│ LLM │

│ (e.g. GPT-4) │

└─────┬──────────┘

▼

┌─────────────┐

│ Answer │

└─────────────┘

Tech Unpacked – Research & Fundamentals with Nitin Sharma

Popular Posts

Search This Blog

Sunday, May 25, 2025

Core Architecture Concepts in RAG, LLMs & GenAI

1. Embeddings

2. Vector Spaces

3. Attention Mechanism

4. Transformers

5. Large Language Models (LLMs)

6. Generative AI (GenAI)

7. Retrieval-Augmented Generation (RAG)

8. Tokenization

9. Context Window

10. Prompt Engineering

11.

Vector Databases

12.

Similarity Search

13.

Fine-tuning vs. Prompting vs. RAG

No comments:

My Profile

Featured Post

🚀 Introducing the Universal API Testing Tool — Built to Catch What Manual Testing Misses

!! IMPORTANT LINKS !!

!! INTERESTING TALKS !!

Contact Form

Labels

Total Pageviews