1.
Embeddings
What it is: A dense vector representation of data (e.g., words, sentences, code).
Why it matters: Converts discrete data (like text) into continuous numerical space that models can process.
Example:
“Dog” → [0.25, -0.12, ..., 0.83]
Words with similar meanings have vectors close in space (semantic similarity).
Used in:
Semantic search in RAG
Input for LLMs
Vector databases
2.
Vector Spaces
What it is: A high-dimensional space where embeddings live.
Why it matters: Vectors allow fast similarity search using measures like cosine similarity or dot product.
Used in:
Finding relevant documents in RAG
Nearest neighbor searches in FAISS or similar vector DBs
3.
Attention Mechanism
What it is: A technique that allows the model to focus on relevant parts of the input sequence when producing output.
Types:
Self-attention: Used in Transformers; compares all tokens in a sequence to each other.
Cross-attention: Used in RAG; queries from LLM attend to retrieved documents.
Why it matters:
Solves long-range dependency problems in sequences.
Enables parallelism (vs. RNNs).
Key math:
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
4.
Transformers
What it is: The architecture underlying modern LLMs.
Components:
Input Embedding + Positional Encoding
Multi-head Attention
Feed-forward Neural Networks
Layer Normalization
Residual Connections
Why it matters: Allows LLMs to scale, understand context, and generate coherent text.
5.
Large Language Models (LLMs)
What it is: Neural networks (typically Transformers) trained on massive corpora to predict and generate human-like language.
Examples: GPT, BERT, Claude, Gemini
Key Traits:
Pretraining: On vast text data using next-token prediction or masked language modeling.
Fine-tuning: For specific tasks (e.g., chat, summarization).
Inference: Generates text one token at a time using learned probabilities.
6.
Generative AI (GenAI)
What it is: Any AI model that can generate new content (text, images, code, etc.).
In NLP:
Models that produce novel text based on prompts or questions.
LLMs are a subset of GenAI.
Modalities:
Text (GPT, Claude)
Code (Codex)
Images (DALL·E, Midjourney)
Video (Sora)
Audio (MusicGen)
7.
Retrieval-Augmented Generation (RAG)
What it is: A hybrid GenAI method that augments LLMs with retrieval from external knowledge.
Flow:
Embed Query → vector space
Retrieve Documents → from vector DB using similarity search
Augment Prompt → LLM receives query + retrieved context
Generate Answer → grounded, up-to-date, accurate
Why it matters:
Reduces hallucination
Enables up-to-date, domain-specific responses
Keeps LLMs smaller and more efficient (vs. training on entire domain data)
8.
Tokenization
What it is: Breaking text into tokens (smaller pieces) before inputting into a model.
Example:
“ChatGPT is smart.” → [‘Chat’, ‘G’, ‘PT’, ‘ is’, ‘ smart’, ‘.’]
Why it matters:
LLMs operate on tokens, not raw text.
Affects context length and cost.
9.
Context Window
What it is: The maximum number of tokens a model can consider at once.
LLMs have limits (e.g., GPT-4 can handle 128k tokens).
Why it matters: Limits how much data (prompt + docs) you can include during RAG.
10.
Prompt Engineering
What it is: Crafting input prompts to guide the LLM’s behavior.
In RAG: Used to incorporate retrieved documents properly.
Example:
You are a Java expert. Based on the following context, answer the user’s question. Context: [...]. Question: [...]
11.
Vector Databases
What it is: Specialized databases that store and search high-dimensional vectors.
Popular tools: FAISS, Pinecone, Weaviate, Qdrant
Role in RAG:
Store document embeddings
Retrieve semantically relevant docs during generation
12.
Similarity Search
What it is: Finding vectors in the database closest to the query vector.
Common Metrics:
Cosine Similarity
Dot Product
Euclidean Distance
13.
Fine-tuning vs. Prompting vs. RAG
Technique
When to Use
Fine-tuning
You want model to learn new tasks from scratch
Prompting
Quick instructions using existing model knowledge
RAG
Inject external, non-memorized knowledge