Mistral AI has just released Mistral Small 4, a 119 billion parameter Mixture-of-Experts (MoE) model that unifies instruction, reasoning, and vision in a single set of weights. For developers and agencies building AI products for clients, this model raises a very practical question: is it the right technical choice, and how do you put it into production?

This guide covers the architecture in detail, deployment options (vLLM, llama.cpp, NVIDIA NIM), benchmark results against GPT-4o-mini, Qwen 3.5-122B, and Gemma 3, along with the pitfalls to avoid.

Mistral Small 4 MoE Architecture: What Developers Need to Know

128 Experts, 4 Active Per Token: The Inference Profile

Mistral Small 4 uses a sparse MoE architecture with the following characteristics:

  • 128 experts in the FFN (feed-forward network) layers

  • 4 experts activated per token during inference

  • 119B total parameters, but only ~6.5B active per forward pass

  • Approximately 95% compute reduction compared to an equivalent dense model

This design follows the same philosophy as DeepSeek V3 and Mixtral 8x22B: massive capacity with contained inference cost. The key difference with Mixtral 8x22B (which activated ~39B parameters) is that Small 4 is significantly more efficient per token while having comparable total capacity.

Architectural Comparison With Previous Mistral Models

Model

Architecture

Total Params

Active Params

Context

Mistral Small 3.2

Dense

24B

24B

128K

Magistral Small

Dense

24B

24B

40K

Devstral Small

Dense

24B

24B

128K

Mixtral 8x22B

MoE

141B

~39B

65K

Mistral Small 4

MoE

119B

~6.5B

256K

The shift from 24B dense to 119B MoE with only 6.5B active is a radical change. In terms of FLOPS per token, Small 4 is cheaper to run than Small 3.2 while having 5 times more knowledge capacity.

Configurable Reasoning via reasoning_effort

The model exposes a reasoning_effort parameter for per-request behavior control:

from openai import OpenAI client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1") # Fast mode (no chain-of-thought) response = client.chat.completions.create( model="mistralai/Mistral-Small-4-119B-2603", messages=[{"role": "user", "content": "Summarize this document"}], reasoning_effort="none", ) # Deep reasoning mode response = client.chat.completions.create( model="mistralai/Mistral-Small-4-119B-2603", messages=[{"role": "user", "content": "Analyze this contract and identify risks"}], reasoning_effort="high", )

For an agency, this means building a client product with a single model behind it, intelligently routing simple queries to fast mode and complex tasks to reasoning mode.

Mistral Small 4 Technical Specifications

Specification

Value

Model ID

mistralai/Mistral-Small-4-119B-2603

Architecture

Transformer MoE

Total parameters

119B

Active parameters per token

~6.5B (~8B including embeddings + output layer)

Experts

128 total, 4 active per token

Context

256,000 tokens (262,144 exact)

Inputs

Text + Image (RGB)

Outputs

Text

Reasoning

Configurable (reasoning_effort: none / high)

Function calling

Native

Structured JSON

Native

Weight format

BF16 + F8_E4M3 (FP8)

License

Apache 2.0

Mistral Small 4 Benchmarks: Results Against GPT-4o-mini, Qwen 3.5-122B, and Phi-4

Scores on Standard Benchmarks

Benchmark

Mistral Small 4

GPT-4o-mini

Phi-4 (14B)

GPQA Diamond

71.2%

40.2%

N/A

MMLU-Pro

78.0%

64.8%

N/A

On GPQA Diamond, the model scores 71.2%, a +31 point advantage over GPT-4o-mini. On MMLU-Pro, the gap is +13 points. These benchmarks measure scientific reasoning and deep comprehension, two areas where the MoE architecture with configurable reasoning makes a difference.

Output Efficiency: Fewer Tokens, Same Quality

This point is often overlooked in comparisons but is critical for production costs:

Test

Mistral Small 4

Qwen 3.5-122B

Ratio

AA LCR (output length)

~1,600 characters

~5,800-6,100 characters

3.5-4x less

LiveCodeBench (length)

20% shorter than GPT-OSS 120B

N/A

N/A

Shorter outputs at equal quality mean fewer billed tokens, less latency, and higher throughput. For an agency billing AI projects, this efficiency directly translates to margin.

Full Comparison Table

Feature

Mistral Small 4

GPT-4o-mini

Phi-4 (14B)

Gemma 3 (27B)

Qwen 3.5-122B

Total params

119B (MoE)

Unknown

14B (dense)

27B (dense)

122B (MoE)

Active params

~6.5B

Unknown

14B

27B

~22B

Context

256K

128K

16K

128K

262K

Vision

Yes

Yes

No

Yes

Yes

Configurable reasoning

Yes

No

No

No

Yes

Function calling

Native

Native

Yes

Yes

Yes

License

Apache 2.0

Proprietary

MIT

Apache 2.0

Apache 2.0

Local deployment

Multi-GPU

API only

Single GPU

Single GPU

Multi-GPU

GPQA Diamond

71.2%

40.2%

N/A

N/A

N/A

MMLU-Pro

78.0%

64.8%

N/A

N/A

N/A

Performance Gains Over Mistral Small 3

Metric

Improvement

Configuration

End-to-end latency

40% faster

Latency-optimized

Requests per second

3x more throughput

Throughput-optimized

These gains are directly tied to the MoE architecture: 6.5B active parameters versus 24B for Small 3, roughly 3.7 times less compute per token. Despite a model 5 times heavier in total weights, each forward pass is faster to compute once the model is loaded in memory.

Mistral also provides an eagle head model (Mistral-Small-4-119B-2603-eagle) for speculative decoding, further reducing generation latency.

How to Deploy Mistral Small 4: Step-by-Step Technical Guide

Option 1: vLLM (Recommended for Production)

Mistral released a dedicated Docker image with fixes for tool calling and reasoning parsing (these fixes will merge into vLLM main within 1-2 weeks):

# Mistral Docker image docker pull mistralllm/vllm-ms4:latest docker run -it mistralllm/vllm-ms4:latest

Manual installation from the Mistral fork:

git clone --branch fix_mistral_parsing https://github.com/juliendenize/vllm.git VLLM_USE_PRECOMPILED=1 pip install --editable . uv pip install git+https://github.com/huggingface/transformers.git # Verify mistral_common >= 1.10.0 python -c "import mistral_common; print(mistral_common.__version__)"

Launch command:

vllm serve mistralai/Mistral-Small-4-119B-2603 \ --max-model-len 262144 \ --tensor-parallel-size 2 \ --attention-backend FLASH_ATTN_MLA \ --tool-call-parser mistral \ --enable-auto-tool-choice \ --reasoning-parser mistral \ --max_num_batched_tokens 16384 \ --max_num_seqs 128 \ --gpu_memory_utilization 0.8

The vLLM server then exposes an OpenAI-compatible API on localhost:8000.

Option 2: NVIDIA NIM (Cloud or On-Prem)

Available from launch day on build.nvidia.com, with free prototyping. For production, NIM containers integrate into your existing infrastructure. The NVFP4 checkpoint is optimized for NVIDIA H100/H200/B200 GPUs and offers the best performance-to-memory ratio.

Option 3: llama.cpp and Ollama (In Progress)

At launch, llama.cpp support is under development via PR #20649 in the official repository. Once merged, community GGUF quantizations (bartowski, unsloth) will follow.

# Once GGUF support is available: ./llama-cli -hf mistralai/Mistral-Small-4-119B-2603-GGUF:Q4_K_M --jinja

Ollama support depends on llama.cpp stability. For now, mistral-small on Ollama points to 3.x versions.

Option 4: SGLang

SGLang announced day-0 support via @lmsysorg. It is a credible alternative to vLLM for high-concurrency workloads.

Option 5: HuggingFace Transformers

For prototyping only. FP8 weights require manual conversion to BF16. The reference code uses Mistral3ForConditionalGeneration.

from transformers import AutoProcessor, Mistral3ForConditionalGeneration # See the full snippet on the HuggingFace model card

Quantization: Available Options and Memory Impact

Checkpoint

Format

Usage

Estimated VRAM

Mistral-Small-4-119B-2603

BF16 + FP8

Production (reference)

~120-240 GB

Mistral-Small-4-119B-2603-eagle

BF16

Speculative decoding

~240 GB

Mistral-Small-4-119B-2603-NVFP4

NVFP4 (4-bit)

NVIDIA optimized production

~60-75 GB

GGUF (community, upcoming)

Q4_K_M and others

Local deployment

~60-75 GB

The NVFP4 checkpoint is the recommended path for NVIDIA deployments. For Apple Silicon setups (M3 Max with 128 GB unified memory), GGUF Q4_K_M quantizations will be the option of choice once available.

Required Infrastructure for Self-Hosting

Configuration

Hardware

Minimum (H100)

4x NVIDIA HGX H100

Minimum (H200)

2x NVIDIA HGX H200

Minimum (B200)

1x NVIDIA DGX B200

Recommended (H100)

4x NVIDIA HGX H100

Recommended (H200)

4x NVIDIA HGX H200

Recommended (B200)

2x NVIDIA DGX B200

Tensor parallelism: the reference vLLM configuration uses --tensor-parallel-size 2 for 2-GPU setups. Larger TP sizes are supported for multi-GPU configurations.

Apple Silicon case: in full BF16, the model requires ~238 GB of RAM. With 4-bit quantization, this drops to ~60-75 GB, making the model usable on an M3 Max with 128 GB of unified memory.

When to Choose Mistral Small 4 Over GPT-4o-mini or Qwen 3.5

Choose Mistral Small 4 if:

  • You need configurable reasoning: GPT-4o-mini offers no reasoning mode. With Small 4, you can alternate between fast mode and deep mode without switching models.

  • You want a self-hostable Apache 2.0 model: GPT-4o-mini is proprietary and API-only. Small 4 can run on your infrastructure with no data sent externally.

  • You are building multi-tool agents: native function calling inherited from Devstral, combined with reasoning and vision, makes it a strong choice for complex agentic workflows.

  • Long context is critical: 256K tokens, double GPT-4o-mini (128K). Ideal for codebase analysis, legal documents, or technical reports.

Prefer GPT-4o-mini if:

  • You do not need self-hosting and the simplicity of the OpenAI API is a priority

  • Your request volume is low and the infrastructure cost of a multi-GPU deployment is not justified

  • You need stable, well-documented models with a mature third-party tool ecosystem

Prefer Qwen 3.5-122B if:

  • You need very long context (262K) comparable to Mistral Small 4 with an alternative MoE ecosystem

  • Your use cases primarily target Asia (Qwen excels on Chinese and Asian benchmarks)

  • Note: Qwen generates significantly longer outputs, which increases production costs

Technical Limitations to Anticipate

  1. Heavy infrastructure: 119B total parameters require at minimum 4x H100. This is not deployable on an RTX 4090 (24 GB VRAM), even with quantization. The r/LocalLLaMA community has highlighted this as the main frustration.

  1. Incomplete llama.cpp support at launch: PR #20649 is open but not merged. No official GGUF, no working Ollama support for now.

  1. vLLM: fixes not merged into main: you need to use the Mistral Docker image or dedicated fork for tool calling and reasoning parser. Fixes should be integrated within 1-2 weeks of launch.

  1. Transformers: FP8 workaround required: FP8 weights (F8_E4M3) are not natively supported by HuggingFace Transformers. Manual conversion to BF16 is required.

  1. No lightweight variant: unlike the Small 3 family (Ministral 3B/8B/14B), there is no lightweight companion model for single-GPU or edge deployment.

  1. API pricing not published: prices on the Mistral API are not yet available. Expect a range between Small 3.1 (~$0.10/1M tokens input) and Medium 3.1 (~$0.40/1M tokens input).

  1. Training data not documented: no information about the training dataset has been disclosed.

For Agencies: Integrating Mistral Small 4 Into Client Projects

Mistral Small 4 is particularly well suited for agencies and AI development studios for several reasons:

  • One model to master and maintain instead of three specialized ones, simplifying team training and project maintenance

  • Apache 2.0 license: you can deploy it at client sites with no license fees and create derivative products

  • Per-request configurable reasoning: you can design workflows where AI complexity automatically adapts to task difficulty, optimizing costs for your clients

  • Fine-tuning possible via Axolotl: you can create specialized versions for client verticals (legal, healthcare, finance)

The model is available now on Hugging Face, via the Mistral API, and on NVIDIA NIM.

Want to automate?

Free 30-min audit. We identify your 3 AI quick wins.

Book a free audit →
Share