Search This Blog

Showing posts with label Vector DB. Show all posts
Showing posts with label Vector DB. Show all posts

Sunday, May 25, 2025

Core Architecture Concepts in RAG, LLMs & GenAI

 

1. 

Embeddings

  • What it is: A dense vector representation of data (e.g., words, sentences, code).

  • Why it matters: Converts discrete data (like text) into continuous numerical space that models can process.

  • Example:

    • “Dog” → [0.25, -0.12, ..., 0.83]

    • Words with similar meanings have vectors close in space (semantic similarity).

  • Used in:

    • Semantic search in RAG

    • Input for LLMs

    • Vector databases


2. 

Vector Spaces

  • What it is: A high-dimensional space where embeddings live.

  • Why it matters: Vectors allow fast similarity search using measures like cosine similarity or dot product.

  • Used in:

    • Finding relevant documents in RAG

    • Nearest neighbor searches in FAISS or similar vector DBs


3. 

Attention Mechanism

  • What it is: A technique that allows the model to focus on relevant parts of the input sequence when producing output.

  • Types:

    • Self-attention: Used in Transformers; compares all tokens in a sequence to each other.

    • Cross-attention: Used in RAG; queries from LLM attend to retrieved documents.

  • Why it matters:

    • Solves long-range dependency problems in sequences.

    • Enables parallelism (vs. RNNs).

  • Key math:

    \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V


4. 

Transformers

  • What it is: The architecture underlying modern LLMs.

  • Components:

    • Input Embedding + Positional Encoding

    • Multi-head Attention

    • Feed-forward Neural Networks

    • Layer Normalization

    • Residual Connections

  • Why it matters: Allows LLMs to scale, understand context, and generate coherent text.


5. 

Large Language Models (LLMs)

  • What it is: Neural networks (typically Transformers) trained on massive corpora to predict and generate human-like language.

  • Examples: GPT, BERT, Claude, Gemini

  • Key Traits:

    • Pretraining: On vast text data using next-token prediction or masked language modeling.

    • Fine-tuning: For specific tasks (e.g., chat, summarization).

    • Inference: Generates text one token at a time using learned probabilities.


6. 

Generative AI (GenAI)

  • What it is: Any AI model that can generate new content (text, images, code, etc.).

  • In NLP:

    • Models that produce novel text based on prompts or questions.

    • LLMs are a subset of GenAI.

  • Modalities:

    • Text (GPT, Claude)

    • Code (Codex)

    • Images (DALL·E, Midjourney)

    • Video (Sora)

    • Audio (MusicGen)


7. 

Retrieval-Augmented Generation (RAG)

  • What it is: A hybrid GenAI method that augments LLMs with retrieval from external knowledge.

  • Flow:

    1. Embed Query → vector space

    2. Retrieve Documents → from vector DB using similarity search

    3. Augment Prompt → LLM receives query + retrieved context

    4. Generate Answer → grounded, up-to-date, accurate

  • Why it matters:

    • Reduces hallucination

    • Enables up-to-date, domain-specific responses

    • Keeps LLMs smaller and more efficient (vs. training on entire domain data)


8. 

Tokenization

  • What it is: Breaking text into tokens (smaller pieces) before inputting into a model.

  • Example:

    • “ChatGPT is smart.” → [‘Chat’, ‘G’, ‘PT’, ‘ is’, ‘ smart’, ‘.’]

  • Why it matters:

    • LLMs operate on tokens, not raw text.

    • Affects context length and cost.


9. 

Context Window

  • What it is: The maximum number of tokens a model can consider at once.

  • LLMs have limits (e.g., GPT-4 can handle 128k tokens).

  • Why it matters: Limits how much data (prompt + docs) you can include during RAG.


10. 

Prompt Engineering

  • What it is: Crafting input prompts to guide the LLM’s behavior.

  • In RAG: Used to incorporate retrieved documents properly.

  • Example:

    You are a Java expert. Based on the following context, answer the user’s question. Context: [...]. Question: [...]



    11. 

    Vector Databases

    • What it is: Specialized databases that store and search high-dimensional vectors.

    • Popular tools: FAISS, Pinecone, Weaviate, Qdrant

    • Role in RAG:

      • Store document embeddings

      • Retrieve semantically relevant docs during generation


    12. 

    Similarity Search

    • What it is: Finding vectors in the database closest to the query vector.

    • Common Metrics:

      • Cosine Similarity

      • Dot Product

      • Euclidean Distance


    13. 

    Fine-tuning vs. Prompting vs. RAG

    Technique

    When to Use

    Fine-tuning

    You want model to learn new tasks from scratch

    Prompting

    Quick instructions using existing model knowledge

    RAG

    Inject external, non-memorized knowledge


          ┌─────────────┐
          │  User Query │
          └─────┬───────┘
                │
                ▼
        ┌──────────────┐
        │  Embed Query │
        └─────┬────────┘
              ▼
    ┌─────────────────────┐
    │   Vector DB Search  │  ←— uses cosine similarity
    └─────┬───────────────┘
          ▼
  ┌───────────────────────┐
  │  Retrieved Documents  │
  └─────┬─────────────────┘
        ▼
┌────────────────────────────┐
│ Prompt + Retrieved Context │
└─────┬──────────────────────┘
      ▼
┌────────────────┐
│     LLM        │
│  (e.g. GPT-4)  │
└─────┬──────────┘
      ▼
┌─────────────┐
│   Answer    │
└─────────────┘

My Profile

My photo
can be reached at 09916017317