Mistral AI has just released Mistral Small 4, a 119 billion parameter Mixture-of-Experts (MoE) model that unifies instruction, reasoning, and vision in a single set of weights. For developers and agencies building AI products for clients, this model raises a very practical question: is it the right technical choice, and how do you put it into production?
This guide covers the architecture in detail, deployment options (vLLM, llama.cpp, NVIDIA NIM), benchmark results against GPT-4o-mini, Qwen 3.5-122B, and Gemma 3, along with the pitfalls to avoid.

Mistral Small 4 MoE Architecture: What Developers Need to Know
128 Experts, 4 Active Per Token: The Inference Profile
Mistral Small 4 uses a sparse MoE architecture with the following characteristics:
128 experts in the FFN (feed-forward network) layers
4 experts activated per token during inference
119B total parameters, but only ~6.5B active per forward pass
Approximately 95% compute reduction compared to an equivalent dense model
This design follows the same philosophy as DeepSeek V3 and Mixtral 8x22B: massive capacity with contained inference cost. The key difference with Mixtral 8x22B (which activated ~39B parameters) is that Small 4 is significantly more efficient per token while having comparable total capacity.
Architectural Comparison With Previous Mistral Models
Model | Architecture | Total Params | Active Params | Context |
|---|---|---|---|---|
Mistral Small 3.2 | Dense | 24B | 24B | 128K |
Magistral Small | Dense | 24B | 24B | 40K |
Devstral Small | Dense | 24B | 24B | 128K |
Mixtral 8x22B | MoE | 141B | ~39B | 65K |
Mistral Small 4 | MoE | 119B | ~6.5B | 256K |
The shift from 24B dense to 119B MoE with only 6.5B active is a radical change. In terms of FLOPS per token, Small 4 is cheaper to run than Small 3.2 while having 5 times more knowledge capacity.
Configurable Reasoning via reasoning_effort
The model exposes a reasoning_effort parameter for per-request behavior control:
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
# Fast mode (no chain-of-thought)
response = client.chat.completions.create(
model="mistralai/Mistral-Small-4-119B-2603",
messages=[{"role": "user", "content": "Summarize this document"}],
reasoning_effort="none",
)
# Deep reasoning mode
response = client.chat.completions.create(
model="mistralai/Mistral-Small-4-119B-2603",
messages=[{"role": "user", "content": "Analyze this contract and identify risks"}],
reasoning_effort="high",
)
For an agency, this means building a client product with a single model behind it, intelligently routing simple queries to fast mode and complex tasks to reasoning mode.
Mistral Small 4 Technical Specifications
Specification | Value |
|---|---|
Model ID |
|
Architecture | Transformer MoE |
Total parameters | 119B |
Active parameters per token | ~6.5B (~8B including embeddings + output layer) |
Experts | 128 total, 4 active per token |
Context | 256,000 tokens (262,144 exact) |
Inputs | Text + Image (RGB) |
Outputs | Text |
Reasoning | Configurable ( |
Function calling | Native |
Structured JSON | Native |
Weight format | BF16 + F8_E4M3 (FP8) |
License | Apache 2.0 |
Mistral Small 4 Benchmarks: Results Against GPT-4o-mini, Qwen 3.5-122B, and Phi-4
Scores on Standard Benchmarks
Benchmark | Mistral Small 4 | GPT-4o-mini | Phi-4 (14B) |
|---|---|---|---|
GPQA Diamond | 71.2% | 40.2% | N/A |
MMLU-Pro | 78.0% | 64.8% | N/A |
On GPQA Diamond, the model scores 71.2%, a +31 point advantage over GPT-4o-mini. On MMLU-Pro, the gap is +13 points. These benchmarks measure scientific reasoning and deep comprehension, two areas where the MoE architecture with configurable reasoning makes a difference.
Output Efficiency: Fewer Tokens, Same Quality
This point is often overlooked in comparisons but is critical for production costs:
Test | Mistral Small 4 | Qwen 3.5-122B | Ratio |
|---|---|---|---|
AA LCR (output length) | ~1,600 characters | ~5,800-6,100 characters | 3.5-4x less |
LiveCodeBench (length) | 20% shorter than GPT-OSS 120B | N/A | N/A |
Shorter outputs at equal quality mean fewer billed tokens, less latency, and higher throughput. For an agency billing AI projects, this efficiency directly translates to margin.
Full Comparison Table
Feature | Mistral Small 4 | GPT-4o-mini | Phi-4 (14B) | Gemma 3 (27B) | Qwen 3.5-122B |
|---|---|---|---|---|---|
Total params | 119B (MoE) | Unknown | 14B (dense) | 27B (dense) | 122B (MoE) |
Active params | ~6.5B | Unknown | 14B | 27B | ~22B |
Context | 256K | 128K | 16K | 128K | 262K |
Vision | Yes | Yes | No | Yes | Yes |
Configurable reasoning | Yes | No | No | No | Yes |
Function calling | Native | Native | Yes | Yes | Yes |
License | Apache 2.0 | Proprietary | MIT | Apache 2.0 | Apache 2.0 |
Local deployment | Multi-GPU | API only | Single GPU | Single GPU | Multi-GPU |
GPQA Diamond | 71.2% | 40.2% | N/A | N/A | N/A |
MMLU-Pro | 78.0% | 64.8% | N/A | N/A | N/A |
Performance Gains Over Mistral Small 3
Metric | Improvement | Configuration |
|---|---|---|
End-to-end latency | 40% faster | Latency-optimized |
Requests per second | 3x more throughput | Throughput-optimized |
These gains are directly tied to the MoE architecture: 6.5B active parameters versus 24B for Small 3, roughly 3.7 times less compute per token. Despite a model 5 times heavier in total weights, each forward pass is faster to compute once the model is loaded in memory.
Mistral also provides an eagle head model (Mistral-Small-4-119B-2603-eagle) for speculative decoding, further reducing generation latency.
How to Deploy Mistral Small 4: Step-by-Step Technical Guide
Option 1: vLLM (Recommended for Production)
Mistral released a dedicated Docker image with fixes for tool calling and reasoning parsing (these fixes will merge into vLLM main within 1-2 weeks):
# Mistral Docker image
docker pull mistralllm/vllm-ms4:latest
docker run -it mistralllm/vllm-ms4:latest
Manual installation from the Mistral fork:
git clone --branch fix_mistral_parsing https://github.com/juliendenize/vllm.git
VLLM_USE_PRECOMPILED=1 pip install --editable .
uv pip install git+https://github.com/huggingface/transformers.git
# Verify mistral_common >= 1.10.0
python -c "import mistral_common; print(mistral_common.__version__)"
Launch command:
vllm serve mistralai/Mistral-Small-4-119B-2603 \
--max-model-len 262144 \
--tensor-parallel-size 2 \
--attention-backend FLASH_ATTN_MLA \
--tool-call-parser mistral \
--enable-auto-tool-choice \
--reasoning-parser mistral \
--max_num_batched_tokens 16384 \
--max_num_seqs 128 \
--gpu_memory_utilization 0.8
The vLLM server then exposes an OpenAI-compatible API on localhost:8000.
Option 2: NVIDIA NIM (Cloud or On-Prem)
Available from launch day on build.nvidia.com, with free prototyping. For production, NIM containers integrate into your existing infrastructure. The NVFP4 checkpoint is optimized for NVIDIA H100/H200/B200 GPUs and offers the best performance-to-memory ratio.
Option 3: llama.cpp and Ollama (In Progress)
At launch, llama.cpp support is under development via PR #20649 in the official repository. Once merged, community GGUF quantizations (bartowski, unsloth) will follow.
# Once GGUF support is available:
./llama-cli -hf mistralai/Mistral-Small-4-119B-2603-GGUF:Q4_K_M --jinja
Ollama support depends on llama.cpp stability. For now, mistral-small on Ollama points to 3.x versions.
Option 4: SGLang
SGLang announced day-0 support via @lmsysorg. It is a credible alternative to vLLM for high-concurrency workloads.
Option 5: HuggingFace Transformers
For prototyping only. FP8 weights require manual conversion to BF16. The reference code uses Mistral3ForConditionalGeneration.
from transformers import AutoProcessor, Mistral3ForConditionalGeneration
# See the full snippet on the HuggingFace model card
Quantization: Available Options and Memory Impact
Checkpoint | Format | Usage | Estimated VRAM |
|---|---|---|---|
| BF16 + FP8 | Production (reference) | ~120-240 GB |
| BF16 | Speculative decoding | ~240 GB |
| NVFP4 (4-bit) | NVIDIA optimized production | ~60-75 GB |
GGUF (community, upcoming) | Q4_K_M and others | Local deployment | ~60-75 GB |
The NVFP4 checkpoint is the recommended path for NVIDIA deployments. For Apple Silicon setups (M3 Max with 128 GB unified memory), GGUF Q4_K_M quantizations will be the option of choice once available.
Required Infrastructure for Self-Hosting
Configuration | Hardware |
|---|---|
Minimum (H100) | 4x NVIDIA HGX H100 |
Minimum (H200) | 2x NVIDIA HGX H200 |
Minimum (B200) | 1x NVIDIA DGX B200 |
Recommended (H100) | 4x NVIDIA HGX H100 |
Recommended (H200) | 4x NVIDIA HGX H200 |
Recommended (B200) | 2x NVIDIA DGX B200 |
Tensor parallelism: the reference vLLM configuration uses --tensor-parallel-size 2 for 2-GPU setups. Larger TP sizes are supported for multi-GPU configurations.
Apple Silicon case: in full BF16, the model requires ~238 GB of RAM. With 4-bit quantization, this drops to ~60-75 GB, making the model usable on an M3 Max with 128 GB of unified memory.
When to Choose Mistral Small 4 Over GPT-4o-mini or Qwen 3.5
Choose Mistral Small 4 if:
You need configurable reasoning: GPT-4o-mini offers no reasoning mode. With Small 4, you can alternate between fast mode and deep mode without switching models.
You want a self-hostable Apache 2.0 model: GPT-4o-mini is proprietary and API-only. Small 4 can run on your infrastructure with no data sent externally.
You are building multi-tool agents: native function calling inherited from Devstral, combined with reasoning and vision, makes it a strong choice for complex agentic workflows.
Long context is critical: 256K tokens, double GPT-4o-mini (128K). Ideal for codebase analysis, legal documents, or technical reports.
Prefer GPT-4o-mini if:
You do not need self-hosting and the simplicity of the OpenAI API is a priority
Your request volume is low and the infrastructure cost of a multi-GPU deployment is not justified
You need stable, well-documented models with a mature third-party tool ecosystem
Prefer Qwen 3.5-122B if:
You need very long context (262K) comparable to Mistral Small 4 with an alternative MoE ecosystem
Your use cases primarily target Asia (Qwen excels on Chinese and Asian benchmarks)
Note: Qwen generates significantly longer outputs, which increases production costs
Technical Limitations to Anticipate
Heavy infrastructure: 119B total parameters require at minimum 4x H100. This is not deployable on an RTX 4090 (24 GB VRAM), even with quantization. The r/LocalLLaMA community has highlighted this as the main frustration.
Incomplete llama.cpp support at launch: PR #20649 is open but not merged. No official GGUF, no working Ollama support for now.
vLLM: fixes not merged into main: you need to use the Mistral Docker image or dedicated fork for tool calling and reasoning parser. Fixes should be integrated within 1-2 weeks of launch.
Transformers: FP8 workaround required: FP8 weights (F8_E4M3) are not natively supported by HuggingFace Transformers. Manual conversion to BF16 is required.
No lightweight variant: unlike the Small 3 family (Ministral 3B/8B/14B), there is no lightweight companion model for single-GPU or edge deployment.
API pricing not published: prices on the Mistral API are not yet available. Expect a range between Small 3.1 (~$0.10/1M tokens input) and Medium 3.1 (~$0.40/1M tokens input).
Training data not documented: no information about the training dataset has been disclosed.
For Agencies: Integrating Mistral Small 4 Into Client Projects
Mistral Small 4 is particularly well suited for agencies and AI development studios for several reasons:
One model to master and maintain instead of three specialized ones, simplifying team training and project maintenance
Apache 2.0 license: you can deploy it at client sites with no license fees and create derivative products
Per-request configurable reasoning: you can design workflows where AI complexity automatically adapts to task difficulty, optimizing costs for your clients
Fine-tuning possible via Axolotl: you can create specialized versions for client verticals (legal, healthcare, finance)
The model is available now on Hugging Face, via the Mistral API, and on NVIDIA NIM.
