Search This Blog

Sunday, May 25, 2025

Core Architecture Concepts in RAG, LLMs & GenAI

 

1. 

Embeddings

  • What it is: A dense vector representation of data (e.g., words, sentences, code).

  • Why it matters: Converts discrete data (like text) into continuous numerical space that models can process.

  • Example:

    • “Dog” → [0.25, -0.12, ..., 0.83]

    • Words with similar meanings have vectors close in space (semantic similarity).

  • Used in:

    • Semantic search in RAG

    • Input for LLMs

    • Vector databases


2. 

Vector Spaces

  • What it is: A high-dimensional space where embeddings live.

  • Why it matters: Vectors allow fast similarity search using measures like cosine similarity or dot product.

  • Used in:

    • Finding relevant documents in RAG

    • Nearest neighbor searches in FAISS or similar vector DBs


3. 

Attention Mechanism

  • What it is: A technique that allows the model to focus on relevant parts of the input sequence when producing output.

  • Types:

    • Self-attention: Used in Transformers; compares all tokens in a sequence to each other.

    • Cross-attention: Used in RAG; queries from LLM attend to retrieved documents.

  • Why it matters:

    • Solves long-range dependency problems in sequences.

    • Enables parallelism (vs. RNNs).

  • Key math:

    \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V


4. 

Transformers

  • What it is: The architecture underlying modern LLMs.

  • Components:

    • Input Embedding + Positional Encoding

    • Multi-head Attention

    • Feed-forward Neural Networks

    • Layer Normalization

    • Residual Connections

  • Why it matters: Allows LLMs to scale, understand context, and generate coherent text.


5. 

Large Language Models (LLMs)

  • What it is: Neural networks (typically Transformers) trained on massive corpora to predict and generate human-like language.

  • Examples: GPT, BERT, Claude, Gemini

  • Key Traits:

    • Pretraining: On vast text data using next-token prediction or masked language modeling.

    • Fine-tuning: For specific tasks (e.g., chat, summarization).

    • Inference: Generates text one token at a time using learned probabilities.


6. 

Generative AI (GenAI)

  • What it is: Any AI model that can generate new content (text, images, code, etc.).

  • In NLP:

    • Models that produce novel text based on prompts or questions.

    • LLMs are a subset of GenAI.

  • Modalities:

    • Text (GPT, Claude)

    • Code (Codex)

    • Images (DALL·E, Midjourney)

    • Video (Sora)

    • Audio (MusicGen)


7. 

Retrieval-Augmented Generation (RAG)

  • What it is: A hybrid GenAI method that augments LLMs with retrieval from external knowledge.

  • Flow:

    1. Embed Query → vector space

    2. Retrieve Documents → from vector DB using similarity search

    3. Augment Prompt → LLM receives query + retrieved context

    4. Generate Answer → grounded, up-to-date, accurate

  • Why it matters:

    • Reduces hallucination

    • Enables up-to-date, domain-specific responses

    • Keeps LLMs smaller and more efficient (vs. training on entire domain data)


8. 

Tokenization

  • What it is: Breaking text into tokens (smaller pieces) before inputting into a model.

  • Example:

    • “ChatGPT is smart.” → [‘Chat’, ‘G’, ‘PT’, ‘ is’, ‘ smart’, ‘.’]

  • Why it matters:

    • LLMs operate on tokens, not raw text.

    • Affects context length and cost.


9. 

Context Window

  • What it is: The maximum number of tokens a model can consider at once.

  • LLMs have limits (e.g., GPT-4 can handle 128k tokens).

  • Why it matters: Limits how much data (prompt + docs) you can include during RAG.


10. 

Prompt Engineering

  • What it is: Crafting input prompts to guide the LLM’s behavior.

  • In RAG: Used to incorporate retrieved documents properly.

  • Example:

    You are a Java expert. Based on the following context, answer the user’s question. Context: [...]. Question: [...]



    11. 

    Vector Databases

    • What it is: Specialized databases that store and search high-dimensional vectors.

    • Popular tools: FAISS, Pinecone, Weaviate, Qdrant

    • Role in RAG:

      • Store document embeddings

      • Retrieve semantically relevant docs during generation


    12. 

    Similarity Search

    • What it is: Finding vectors in the database closest to the query vector.

    • Common Metrics:

      • Cosine Similarity

      • Dot Product

      • Euclidean Distance


    13. 

    Fine-tuning vs. Prompting vs. RAG

    Technique

    When to Use

    Fine-tuning

    You want model to learn new tasks from scratch

    Prompting

    Quick instructions using existing model knowledge

    RAG

    Inject external, non-memorized knowledge


          ┌─────────────┐
          │  User Query │
          └─────┬───────┘
                │
                ▼
        ┌──────────────┐
        │  Embed Query │
        └─────┬────────┘
              ▼
    ┌─────────────────────┐
    │   Vector DB Search  │  ←— uses cosine similarity
    └─────┬───────────────┘
          ▼
  ┌───────────────────────┐
  │  Retrieved Documents  │
  └─────┬─────────────────┘
        ▼
┌────────────────────────────┐
│ Prompt + Retrieved Context │
└─────┬──────────────────────┘
      ▼
┌────────────────┐
│     LLM        │
│  (e.g. GPT-4)  │
└─────┬──────────┘
      ▼
┌─────────────┐
│   Answer    │
└─────────────┘

No comments:

My Profile

My photo
can be reached at 09916017317