Tech Unpacked – Research & Fundamentals with Nitin Sharma: LLM

Showing posts with label LLM. Show all posts

Sunday, August 3, 2025

Running LLMs Locally in 2025: Full Setup Guide, Optimized Models & Performance Insights

Learn how to run LLMs locally, explore top tools like Ollama & GPT4All, and integrate them with n8n for private, cost-effective AI workflows.

Have you ever worried about the costs of using ChatGPT for your projects? Or perhaps you work in a field with strict data governance rules, making it difficult to use cloud-based AI solutions?

If so, running Large Language Models (LLMs) locally could be the answer you've been looking for.

Local LLMs offer a cost-effective and secure alternative to cloud-based options. By running models on your own hardware, you can avoid the recurring costs of API calls and keep your sensitive data within your own infrastructure. This is particularly beneficial in industries like healthcare, finance, and legal, where data privacy is paramount.

Experimenting and tinkering with LLMs on your local machine can also be a fantastic learning opportunity, deepening your understanding of AI and its applications.

What is a local LLM?

A local LLM is simply a large language model that runs locally, on your computer, eliminating the need to send your data to a cloud provider. This means you can harness the power of an LLM while maintaining full control over your sensitive information, ensuring privacy and security.

By running an LLM locally, you have the freedom to experiment, customize, and fine-tune the model to your specific needs without external dependencies. You can choose from a wide range of open-source models, tailor them to your specific tasks, and even experiment with different configurations to optimize performance.

While there might be upfront costs for suitable hardware, you can avoid the recurring expenses associated with API calls, potentially leading to significant savings in the long run. This makes local LLMs a more cost-effective solution, especially for high-volume usage.

Can I run LLM locally?

So, you're probably wondering, "Can I actually run an LLM on my local workstation?". The good news is that you likely can do so if you have a relatively modern laptop or desktop! However, some hardware considerations can significantly impact the speed of prompt answering and overall performance.

Let’s look at 3 components you’ll need to experiment with local LLMs.

Hardware requirements

While not strictly necessary, having a PC or laptop with a dedicated graphics card is highly recommended. This will significantly improve the performance of LLMs, as they can leverage the GPU for faster computations. Without a dedicated GPU, LLMs might run quite slowly, making them impractical for real-world use.

The GPU's video RAM (vRAM) plays a pivotal role here: it determines the maximum size and complexity of the LLM that can be loaded and processed efficiently. More vRAM allows larger models to fit entirely on the GPU, leading to significantly faster speeds, as accessing model parameters from vRAM is orders of magnitude quicker than from standard system RAM.

LLMs can be quite resource-intensive, so it's essential to have enough RAM and storage space to accommodate them. The exact requirements will vary depending on the specific LLM you choose, but having at least 16GB of RAM and a decent amount of free disk space is a good starting point.

Software requirements

Besides the hardware, you also need the right software to effectively run and manage LLMs locally. This software generally falls into three categories:

Servers: these run and manage LLMs in the background, handling tasks like loading models, processing requests, and generating responses. They provide the essential infrastructure for your LLMs. Some examples are Ollama and Lalamafile.
User interfaces: these provide a visual way to interact with your LLMs. They allow you to input prompts, view generated text, and potentially customize the model's behavior. User interfaces make it easier to experiment with LLMs. Some examples are OpenWebUI and LobeChat.
Full-stack solutions: these are all-in-one tools that combine the server and the user interface components. They handle everything from model management to processing and provide a built-in visual interface for interacting with the LLMs. They are particularly suitable for users who prefer a simplified setup. Some examples are GPT4All and Jan.

Open source LLMs

Last, but not least, you need the LLMs themselves. These are the large language models that will process your prompts and generate text. There are many different LLMs available, each with its own strengths and weaknesses. Some are better at generating creative text formats, while others are suited for writing code.

Where can you download the LLMs from? One popular source for open-source LLMs is Hugging Face. They have a large repository of models that you can download and use for free.

Next, let's look at what are some of the most popular LLMs to get started with.

Which LLMs to run locally?

The landscape of LLMs you can run on your own hardware is rapidly expanding, with newer, more capable, or more specialized models being released every day!

Many powerful open-source models are available, catering to a wide range of tasks and computational resources. Let's explore some popular options, categorized by their general capabilities and specializations!

General-purpose model families

Several families of models have gained significant popularity in the open-source community due to their strong performance across various benchmarks and tasks.

Llama (Meta AI): The Llama series, particularly Llama 3 and its variants, are highly capable models known for their strong reasoning and general text generation abilities. They come in various sizes, making them adaptable to different hardware setups. The newest iteration, Llama 4, has been released, however, its size exceeds the capabilities of standard hardware for now.
Qwen (Alibaba Cloud): The Qwen family offers a range of models, including multilingual capabilities and versions optimized for coding. They are recognized for their performance, and tool calling abilities. Qwen 2.5 has extremely good performance, especially compared to its size. The recently launched Qwen 3 is even better across benchmarks!
DeepSeek: DeepSeek models, including the DeepSeek-R1 series, are often highlighted for their reasoning and coding proficiency. They provide strong open-source alternatives with competitive performance.
Phi (Microsoft): Microsoft's Phi models focus on achieving high performance with smaller parameter counts, making them excellent candidates for resource-constrained local setups while still offering surprising capabilities, particularly in reasoning and coding.
Gemma (Google): Gemma models represent a family of lightweight, state-of-the-art open models built from the same research and technology used to create Gemini models. They are designed to run on a single GPU making them ideal for local deployment! The latest iteration, Gemma 3, offers various sizes (e.g., 1B, 4B, 12B and 27B parameters) and is known for strong general performance, especially considering model size.
Mistral (Mistral AI): Mistral AI, a French company, offers a popular family of powerful and efficient open-source models (many under Apache 2.0 license), including the influential Mistral 7B and various Mixtral (Mixture of Experts) versions. These models are known for strong performance in reasoning and coding, come in diverse sizes suitable for local setups, and are praised for their efficiency.
Granite (IBM): IBM's Granite models are another family available for open use. The Granite 3.3 iteration, for example, offers variants with 2B and 8B parameters, providing options suitable for different local hardware configurations.

Models with advanced capabilities

Beyond general text generation, many open-source models excel in specific advanced capabilities:

Reasoning models: Models like DeepSeek-R1 and specific fine-tunes of Llama or Mistral are often optimized for complex reasoning, problem-solving, and logical deduction tasks. Microsoft’s Phi family of models also offer reasoning variants, in the form of phi4-reasoning and phi4-mini-reasoning.
Mixture-of-experts (MoE): This architecture allows models to scale efficiently by activating only relevant "expert" parts of the network for a given input. Qwen 3 is a MoE model, and Granite also has a MoE variant in the form of granite3.1-moe.
Tool calling models: The ability for an LLM to use external tools (like APIs or functions) is fundamental to building agentic AI systems. Models are increasingly being trained or fine-tuned with tool-calling capabilities, allowing them to interact with external systems to gather information or perform actions. Frameworks like LangChain or LlamaIndex often facilitate this when running models locally. Examples include qwen3, granite3.3, mistral-small3.1 and phi4-mini.
Vision models: sometimes also called multimodal models, are models that can understand and interpret images alongside text. They are becoming more common in the open-source space. Examples include Granite3.2-vision, llama3.2-vision, llava-phi3, and BakLLaVA (which is derived from Mistral 7B).

Models that excel at specific tasks

Sometimes, you need a model fine-tuned for a particular domain or task for optimal performance.

Coding assistants:

DeepCoder: A fully open-source family (1.5B and 14B parameters) aimed at high-performance code generation.
OpenCoder: An open and reproducible code LLM family (1.5B and 8B models) supporting chat in English and Chinese.
Qwen2.5-Coder: Part of the Qwen family, specifically optimized for code-related tasks.

Math and research

Starling-LM-11B-alpha: Mistral-based model for research and instruction-following.
Mathstral: Specialized Mistral AI model for advanced mathematical reasoning.
Qwen2-math: Part of the Qwen family, specifically optimized for complex mathematical problem-solving.

Creative writing

Mistral-7B-OpenOrca: A fine-tuned version of Mistral AI's base Mistral-7B model, specifically enhanced by training on a curated selection of the OpenOrca dataset.

Choosing the right open-source model depends heavily on your specific needs, the tasks you want to perform, and the hardware you have available. Experimenting with different models is often the best way to find the perfect fit for your local LLM setup.

How to run LLMs locally?

To run LLMs locally, the first step is choosing which model best fits your needs. Once you've selected a model, the next decision is how to run it—most commonly using software like Ollama. However, Ollama isn’t your only option. There are several other powerful and user-friendly tools available for running local LLMs, each with its own strengths.

Let’s explore some of the most popular choices below!

Ollama (+ OpenWebUI)

Ollama is a command-line tool that simplifies the process of downloading and running LLMs locally. It has a simple set of commands for managing models, making it easy to get started.

Ollama is ideal for quickly trying out different open-source LLMs, especially for users comfortable with the command line. It’s also the go-to tool for homelab and self-hosting enthusiasts who can use Ollama as an AI backend for various applications.

While Ollama itself is primarily a command-line tool, you can enhance its usability by pairing it with OpenWebUI, which provides a graphical interface for interacting with your LLMs.

** Connect your local Ollama setup to n8n and start using your downloaded LLMs in any of your automation workflows! Check out this YouTube video to see how to create a local AI agent for free with n8n and Ollama.

Pros

Simple and easy to use
Supports a wide range of open-source models
Runs on most hardware and major operating systems

Cons

Primarily command-line based (without OpenWebUI), which may not be suitable for all users.

LM Studio

LM Studio is a platform designed to make it easy to run and experiment with LLMs locally. It offers a range of tools for customizing and fine-tuning your LLMs, allowing you to optimize their performance for specific tasks.

It is excellent for customizing and fine-tuning LLMs for specific tasks, making it a favorite among researchers and developers seeking granular control over their AI solutions.

Pros

Model customization options
Ability to fine-tune LLMs
Track and compare the performance of different models and configurations to identify the best approach for your use case.
Runs on most hardware and major operating systems

Cons

Steeper learning curve compared to other tools
Fine-tuning and experimenting with LLMs can demand significant computational resources.

Jan

Jan is another noteworthy option for running LLMs locally. It places a strong emphasis on privacy and security. It can be used to interact with both local and remote (cloud-based) LLMs.

One of Jan's unique features is its flexibility in terms of server options. While it offers its own local server, Jan can also integrate with Ollama and LM Studio, utilizing them as remote servers. This is particularly useful when you want to use Jan as a client and have LLMs running on a more powerful server.

Pros

Strong focus on privacy and security
Flexible server options, including integration with Ollama and LM Studio
Jan offers a user-friendly experience, even for those new to running LLMs locally

Cons

While compatible with most hardware, support for AMD GPUs is still in development.

GPT4All

GPT4All is designed to be user-friendly, offering a chat-based interface that makes it easy to interact with the LLMs. It has out-of-the-box support for “LocalDocs”, a feature allowing you to chat privately and locally with your documents.

Pros

Intuitive chat-based interface
Runs on most hardware and major operating systems
Open-source and community-driven
Enterprise edition available

Cons

May not be as feature-rich as some other options, lacking in areas such as model customization and fine-tuning.

NextChat

NextChat is a versatile platform designed for building and deploying conversational AI experiences. Unlike the other options on this list, which primarily focus on running open-source LLMs locally, NextChat excels at integrating with closed-source models like ChatGPT and Google Gemini.

Pros

Compatibility with a wide range of LLMs, including closed-source models
Robust tools for building and deploying conversational AI experiences
Enterprise-focused features and integrations

Cons

May be overkill for simple local LLM experimentation
Geared towards more complex conversational AI applications.

How to run a local LLM with n8n?

Now that you’re familiar with what local LLMs are, the hardware and software they require, and the most popular tools for running them on your machine, the next step is putting that power to work.

If you're looking to automate tasks, build intelligent workflows, or integrate LLMs into broader systems, n8n offers a flexible way to do just that.

In the following section, we’ll walk through how to run a local LLM with n8n—connecting your model, setting up a workflow, and chatting with it seamlessly using tools like Ollama.

n8n uses LangChain to simplify the development of complex interactions with LLMs such as chaining multiple prompts together, implementing decision making and interacting with external data sources. The low-code approach that n8n uses, fits perfectly with the modular nature of LangChain, allowing users to assemble and customize LLM workflows without extensive coding.

Now, let's also explore a quick local LLM workflow!

With this n8n workflow, you can easily chat with your self-hosted Large Language Models (LLMs) through a simple, user-friendly interface. By hooking up to Ollama, a handy tool for managing local LLMs, you can send prompts and get AI-generated responses right within n8n:

Step 1: Install Ollama and run a model

Installing Ollama is straightforward, just download the Ollama installer for your operating system. You can install Ollama on Windows, Mac or Linux.

After you’ve installed Ollama, you can pull a model such as Llama3, with the ollama pull llama3 command:

Depending on the model, the download can take some time. This version of Llama3, for example, is 4.7 Gb.After the download is complete, run ollama run llama3 and you can start chatting with the model right from the command line!

Step 2: Set up a chat workflow

Let’s now set up a simple n8n workflow that uses your local LLM running with Ollama. Here is a sneak peek of the workflow we will build:

n8n workflow with local LLM using Ollama

Start by adding a Chat trigger node, which is the workflow starting point for building chat interfaces with n8n. Then we need to connect the chat trigger to a Basic LLM Chain where we will set the prompt and configure the LLM to use.

Step 3: Connect n8n with Ollama

Connecting Ollama with n8n couldn’t be easier thanks to the Ollama Model sub-node! Ollama is a background process running on your computer and exposes an API on port 11434. You can check if the Ollama API is running by opening a browser window and accessing http://localhost :11434, and you should see a message saying “Ollama is running”.

For n8n to be able to communicate with Ollama’s API via localhost, both applications need to be on the same network. If you are running n8n in Docker, you would need to start the Docker container with the --network=host parameter. That way the n8n container can access any port on the host’s machine.

To set a connection between n8n and Ollama, we simply leave everything as default in the Ollama connection window:

After the connection to the Ollama API is successful, in the Model dropdown you should not see all the models you’ve downloaded. Just pick the llama3:latest model we’ve downloaded earlier.

Step 4: Chat with Llama3

Next, let's chat with our local LLM! Click the Chat button on the bottom of the workflow page to test it out. Type any message and your local LLM should respond. It’s that easy!

Wrap up

Running LLMs locally is not only doable but also practical for those who prioritize privacy, cost savings, or want a deeper understanding of AI.

Thanks to tools like Ollama, which make it easier to run LLMs on consumer hardware, and platforms like n8n, which help you build AI-powered applications, using LLMs on your own computer is now simpler than ever!

What’s next?

Now that you've explored how to run LLMs locally, why not dive deeper into practical applications? Check out these YouTube videos:

Get started with local AI agents: Learn how to Build a Local AI Agent with N8N, Postgres, and Ollama (Free) or explore another comprehensive tutorial on Setting Up Local AI Agents Without Code using similar tools.
Explore vector databases with local LLMs: Discover how to Build a Local AI Agent with Qdrant and Ollama to Interact with Your Documents.

Happy Learning !! :)

Saturday, July 5, 2025

The Complete Guide to LLM Parameters: Mastering AI Model Configuration

Large Language Models (LLMs) have revolutionized how we interact with AI, but their true power lies in understanding and fine-tuning their parameters. Whether you're a developer integrating AI into your applications or a researcher pushing the boundaries of what's possible, mastering these parameters is crucial for achieving optimal results.

Understanding Parameter Categories

Before diving into specific parameters, it's essential to understand that LLM configuration involves three distinct categories:

Parameters: Control the model's behavior during inference
Hyperparameters: Define the model's architecture and training process
Configuration Settings: Manage practical aspects of model deployment

Core Sampling Parameters

Temperature: The Creativity Controller

Temperature is perhaps the most influential parameter in shaping model output. Operating on a scale from 0.0 to 1.0+ (though values above 1.0 are possible), it fundamentally alters how the model selects its next token.


python
# Low temperature example
response = model.generate(prompt, temperature=0.1)
# Output: Highly deterministic, focused responses

# High temperature example  
response = model.generate(prompt, temperature=1.5)
# Output: Creative, unpredictable, potentially chaotic responses

Technical Implementation: Temperature scales the logits before applying softmax, effectively flattening or sharpening the probability distribution. A temperature of 0.1 makes the model nearly deterministic, while 2.0 creates a much flatter distribution where less likely tokens have higher selection probability.

Best Practices:

Code generation: 0.1-0.3
Creative writing: 0.7-1.2
Analytical tasks: 0.2-0.5

Top-P (Nucleus Sampling): Dynamic Vocabulary Control

Top-P sampling represents a more sophisticated approach to controlling model output than traditional top-k sampling. Instead of selecting from a fixed number of tokens, it dynamically adjusts the candidate pool based on cumulative probability.


python
# Top-P implementation concept
def nucleus_sampling(logits, top_p=0.95):
    sorted_logits = torch.sort(logits, descending=True)
    cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
    
    # Remove tokens with cumulative probability above threshold
    sorted_indices_to_remove = cumulative_probs > top_p
    sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
    sorted_indices_to_remove[..., 0] = 0
    
    return apply_mask(logits, sorted_indices_to_remove)

Key Advantages:

Maintains quality while preserving diversity
Adapts to context complexity automatically
Reduces likelihood of generating nonsensical text

Top-K: Fixed Vocabulary Limiting

Top-K sampling restricts the model to considering only the K most probable tokens at each step. While simpler than Top-P, it provides consistent behavior across different contexts.

Performance Considerations:

Lower computational overhead than Top-P
More predictable behavior for debugging
Less adaptive to context complexity

Repetition Control Mechanisms

Repetition Penalty: Combating Redundancy

Repetition penalty addresses one of the most common issues in text generation: the model's tendency to repeat phrases or enter loops. The penalty is applied exponentially to previously generated tokens.


python
# Repetition penalty formula
penalized_score = original_score / (penalty_factor ** repetition_count)

Implementation Strategy:

Values between 1.0-1.2: Subtle discouragement
Values between 1.2-1.5: Moderate repetition control
Values above 1.5: Aggressive anti-repetition (may harm coherence)

Frequency and Presence Penalties: Advanced Repetition Control

These parameters offer more nuanced control over repetition:

Frequency Penalty: Scales with how often a token appears
Presence Penalty: Binary penalty for any token that has appeared


python
# Frequency penalty calculation
frequency_penalty = frequency_penalty_coefficient * token_frequency

# Presence penalty calculation  
presence_penalty = presence_penalty_coefficient * (1 if token_used else 0)

Memory and Context Management

Context Window: The Memory Bottleneck

The context window defines how much conversation history the model can access. This hyperparameter is typically fixed during model training but critically impacts performance.

Current Landscape:

GPT-3.5: 4,096 tokens
GPT-4: 8,192-32,768 tokens
Claude-2: 100,000+ tokens
Some specialized models: 1M+ tokens

Optimization Strategies:

Implement sliding window approaches for long conversations
Use summarization techniques to compress context
Prioritize recent context over distant history

Token Limits: Controlling Response Length

Max tokens settings prevent runaway generation and manage computational costs. However, setting this too low can result in truncated responses.


python
# Dynamic token limiting based on context
def calculate_max_tokens(context_length, target_response_ratio=0.3):
    available_tokens = context_window - context_length
    return min(available_tokens, int(context_window * target_response_ratio))

Advanced Configuration Techniques

System Prompts: Behavioral Programming

System prompts act as persistent instructions that shape the model's behavior throughout the conversation. They're particularly powerful for:

Defining consistent personas
Establishing output formats
Setting behavioral constraints


python
system_prompt = """You are a senior software engineer with expertise in Python and machine learning. 
Always provide code examples with your explanations and consider performance implications."""

Role-Based Conversations: Structured Interactions

Defining user and assistant roles helps maintain conversation structure and can improve model performance in specific domains.

Seed Values: Reproducible Randomness

Setting seed values ensures reproducible outputs, crucial for debugging and A/B testing implementations.


python
# Reproducible generation
torch.manual_seed(42)
response = model.generate(prompt, temperature=0.8, seed=42)

Practical Implementation Guidelines

Parameter Tuning Workflow

Start with defaults: Begin with recommended values (temperature=0.7, top_p=0.95)
Adjust for use case: Modify based on whether you need creativity or precision
Test systematically: Use consistent prompts to evaluate changes
Monitor quality metrics: Track relevance, coherence, and diversity
Iterate based on results: Make incremental adjustments

Common Pitfalls and Solutions

High Temperature + Low Top-P: Can create incoherent outputs

Solution: Balance both parameters or use temperature OR top-p, not both aggressively

Excessive Repetition Penalty: May harm natural language flow

Solution: Start with 1.1-1.2 and increase gradually

Context Window Overflow: Leads to truncated conversations

Solution: Implement context management strategies early

Performance Optimization

Computational Considerations

Different parameter combinations have varying computational costs:

Temperature scaling: Minimal overhead
Top-P sampling: Moderate overhead (sorting required)
Top-K sampling: Low overhead (simple truncation)
Repetition penalties: Moderate overhead (history tracking)

Memory Management


python
# Efficient context management
class ContextManager:
    def __init__(self, max_tokens=4096):
        self.max_tokens = max_tokens
        self.context_buffer = []
    
    def add_message(self, message):
        self.context_buffer.append(message)
        self._trim_context()
    
    def _trim_context(self):
        total_tokens = sum(len(msg.split()) for msg in self.context_buffer)
        while total_tokens > self.max_tokens and len(self.context_buffer) > 1:
            self.context_buffer.pop(0)
            total_tokens = sum(len(msg.split()) for msg in self.context_buffer)

Future Considerations

As LLM technology evolves, new parameters and techniques continue to emerge:

Adaptive sampling: Dynamic parameter adjustment based on context
Multi-modal parameters: Handling text, image, and audio inputs
Fine-tuning parameters: Model customization for specific domains
Efficiency parameters: Balancing quality with computational cost

Conclusion

Mastering LLM parameters is both an art and a science. While understanding the technical mechanics is crucial, the real skill lies in knowing when and how to adjust these parameters for specific use cases. The key is systematic experimentation combined with a deep understanding of your application's requirements.

Remember that optimal parameter settings are highly dependent on your specific use case, target audience, and quality requirements. Start with established defaults, understand the impact of each parameter, and iterate based on empirical results.

The future of AI development lies not just in more powerful models, but in more sophisticated parameter tuning and configuration management. By mastering these fundamentals today, you're building the foundation for tomorrow's AI applications.

Popular Posts

Search This Blog

Sunday, August 3, 2025

Running LLMs Locally in 2025: Full Setup Guide, Optimized Models & Performance Insights

What is a local LLM?

Hardware requirements

Software requirements

Open source LLMs

Which LLMs to run locally?

General-purpose model families

Models with advanced capabilities

Models that excel at specific tasks

How to run LLMs locally?

Ollama (+ OpenWebUI)

Pros

Cons

LM Studio

Pros

Cons

Jan

Pros

Cons

GPT4All

Pros

Cons

NextChat

Pros

Cons

How to run a local LLM with n8n?

Step 2: Set up a chat workflow

Step 3: Connect n8n with Ollama

Step 4: Chat with Llama3

Wrap up

What’s next?

Saturday, July 5, 2025

The Complete Guide to LLM Parameters: Mastering AI Model Configuration

Understanding Parameter Categories

Core Sampling Parameters

Temperature: The Creativity Controller

Top-P (Nucleus Sampling): Dynamic Vocabulary Control

Top-K: Fixed Vocabulary Limiting

Repetition Control Mechanisms

Repetition Penalty: Combating Redundancy

Frequency and Presence Penalties: Advanced Repetition Control

Memory and Context Management

Context Window: The Memory Bottleneck

Token Limits: Controlling Response Length

Advanced Configuration Techniques

System Prompts: Behavioral Programming

Role-Based Conversations: Structured Interactions

Seed Values: Reproducible Randomness

Practical Implementation Guidelines

Parameter Tuning Workflow

Common Pitfalls and Solutions

Performance Optimization

Computational Considerations

Memory Management

Future Considerations

Conclusion

My Profile

!! IMPORTANT LINKS !!

!! INTERESTING TALKS !!

Contact Form

Labels

Total Pageviews