Mistral Small 4: How to Deploy the 119B MoE Model (Benchmarks, vLLM, Quantization & Alternatives)

Mistral AI has just released Mistral Small 4, a 119 billion parameter Mixture-of-Experts (MoE) model that unifies instruction, reasoning, and vision in a single set of weights. For developers and agencies building AI products for clients, this model raises a very practical question: is it the right technical choice, and how do you put it into production?

This guide covers the architecture in detail, deployment options (vLLM, llama.cpp, NVIDIA NIM), benchmark results against GPT-4o-mini, Qwen 3.5-122B, and Gemma 3, along with the pitfalls to avoid.

See @MistralAI's post on X

Mistral Small 4 MoE Architecture: What Developers Need to Know

128 Experts, 4 Active Per Token: The Inference Profile

Mistral Small 4 uses a sparse MoE architecture with the following characteristics:

128 experts in the FFN (feed-forward network) layers
4 experts activated per token during inference
119B total parameters, but only ~6.5B active per forward pass
Approximately 95% compute reduction compared to an equivalent dense model

This design follows the same philosophy as DeepSeek V3 and Mixtral 8x22B: massive capacity with contained inference cost. The key difference with Mixtral 8x22B (which activated ~39B parameters) is that Small 4 is significantly more efficient per token while having comparable total capacity.

Architectural Comparison With Previous Mistral Models

Model	Architecture	Total Params	Active Params	Context
Mistral Small 3.2	Dense	24B	24B	128K
Magistral Small	Dense	24B	24B	40K
Devstral Small	Dense	24B	24B	128K
Mixtral 8x22B	MoE	141B	~39B	65K
Mistral Small 4	MoE	119B	~6.5B	256K

The shift from 24B dense to 119B MoE with only 6.5B active is a radical change. In terms of FLOPS per token, Small 4 is cheaper to run than Small 3.2 while having 5 times more knowledge capacity.

Configurable Reasoning via reasoning_effort

The model exposes a reasoning_effort parameter for per-request behavior control:

from openai import OpenAI client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1") # Fast mode (no chain-of-thought) response = client.chat.completions.create( model="mistralai/Mistral-Small-4-119B-2603", messages=[{"role": "user", "content": "Summarize this document"}], reasoning_effort="none", ) # Deep reasoning mode response = client.chat.completions.create( model="mistralai/Mistral-Small-4-119B-2603", messages=[{"role": "user", "content": "Analyze this contract and identify risks"}], reasoning_effort="high", )

For an agency, this means building a client product with a single model behind it, intelligently routing simple queries to fast mode and complex tasks to reasoning mode.

Mistral Small 4 Technical Specifications

Specification	Value
Model ID	`mistralai/Mistral-Small-4-119B-2603`
Architecture	Transformer MoE
Total parameters	119B
Active parameters per token	~6.5B (~8B including embeddings + output layer)
Experts	128 total, 4 active per token
Context	256,000 tokens (262,144 exact)
Inputs	Text + Image (RGB)
Outputs	Text
Reasoning	Configurable (`reasoning_effort`: none / high)
Function calling	Native
Structured JSON	Native
Weight format	BF16 + F8_E4M3 (FP8)
License	Apache 2.0

Mistral Small 4 Benchmarks: Results Against GPT-4o-mini, Qwen 3.5-122B, and Phi-4

Scores on Standard Benchmarks

Benchmark	Mistral Small 4	GPT-4o-mini	Phi-4 (14B)
GPQA Diamond	71.2%	40.2%	N/A
MMLU-Pro	78.0%	64.8%	N/A

On GPQA Diamond, the model scores 71.2%, a +31 point advantage over GPT-4o-mini. On MMLU-Pro, the gap is +13 points. These benchmarks measure scientific reasoning and deep comprehension, two areas where the MoE architecture with configurable reasoning makes a difference.

Output Efficiency: Fewer Tokens, Same Quality

This point is often overlooked in comparisons but is critical for production costs:

Test	Mistral Small 4	Qwen 3.5-122B	Ratio
AA LCR (output length)	~1,600 characters	~5,800-6,100 characters	3.5-4x less
LiveCodeBench (length)	20% shorter than GPT-OSS 120B	N/A	N/A

Shorter outputs at equal quality mean fewer billed tokens, less latency, and higher throughput. For an agency billing AI projects, this efficiency directly translates to margin.

Full Comparison Table

Feature	Mistral Small 4	GPT-4o-mini	Phi-4 (14B)	Gemma 3 (27B)	Qwen 3.5-122B
Total params	119B (MoE)	Unknown	14B (dense)	27B (dense)	122B (MoE)
Active params	~6.5B	Unknown	14B	27B	~22B
Context	256K	128K	16K	128K	262K
Vision	Yes	Yes	No	Yes	Yes
Configurable reasoning	Yes	No	No	No	Yes
Function calling	Native	Native	Yes	Yes	Yes
License	Apache 2.0	Proprietary	MIT	Apache 2.0	Apache 2.0
Local deployment	Multi-GPU	API only	Single GPU	Single GPU	Multi-GPU
GPQA Diamond	71.2%	40.2%	N/A	N/A	N/A
MMLU-Pro	78.0%	64.8%	N/A	N/A	N/A

Performance Gains Over Mistral Small 3

Metric	Improvement	Configuration
End-to-end latency	40% faster	Latency-optimized
Requests per second	3x more throughput	Throughput-optimized

These gains are directly tied to the MoE architecture: 6.5B active parameters versus 24B for Small 3, roughly 3.7 times less compute per token. Despite a model 5 times heavier in total weights, each forward pass is faster to compute once the model is loaded in memory.

Mistral also provides an eagle head model (Mistral-Small-4-119B-2603-eagle) for speculative decoding, further reducing generation latency.

How to Deploy Mistral Small 4: Step-by-Step Technical Guide

Option 1: vLLM (Recommended for Production)

Mistral released a dedicated Docker image with fixes for tool calling and reasoning parsing (these fixes will merge into vLLM main within 1-2 weeks):

# Mistral Docker image docker pull mistralllm/vllm-ms4:latest docker run -it mistralllm/vllm-ms4:latest

Manual installation from the Mistral fork:

git clone --branch fix_mistral_parsing https://github.com/juliendenize/vllm.git VLLM_USE_PRECOMPILED=1 pip install --editable . uv pip install git+https://github.com/huggingface/transformers.git # Verify mistral_common >= 1.10.0 python -c "import mistral_common; print(mistral_common.__version__)"

Launch command:

vllm serve mistralai/Mistral-Small-4-119B-2603 \ --max-model-len 262144 \ --tensor-parallel-size 2 \ --attention-backend FLASH_ATTN_MLA \ --tool-call-parser mistral \ --enable-auto-tool-choice \ --reasoning-parser mistral \ --max_num_batched_tokens 16384 \ --max_num_seqs 128 \ --gpu_memory_utilization 0.8

The vLLM server then exposes an OpenAI-compatible API on localhost:8000.

Option 2: NVIDIA NIM (Cloud or On-Prem)

Available from launch day on build.nvidia.com, with free prototyping. For production, NIM containers integrate into your existing infrastructure. The NVFP4 checkpoint is optimized for NVIDIA H100/H200/B200 GPUs and offers the best performance-to-memory ratio.

Option 3: llama.cpp and Ollama (In Progress)

At launch, llama.cpp support is under development via PR #20649 in the official repository. Once merged, community GGUF quantizations (bartowski, unsloth) will follow.

# Once GGUF support is available: ./llama-cli -hf mistralai/Mistral-Small-4-119B-2603-GGUF:Q4_K_M --jinja

Ollama support depends on llama.cpp stability. For now, mistral-small on Ollama points to 3.x versions.

Option 4: SGLang

SGLang announced day-0 support via @lmsysorg. It is a credible alternative to vLLM for high-concurrency workloads.

Option 5: HuggingFace Transformers

For prototyping only. FP8 weights require manual conversion to BF16. The reference code uses Mistral3ForConditionalGeneration.

from transformers import AutoProcessor, Mistral3ForConditionalGeneration # See the full snippet on the HuggingFace model card

Quantization: Available Options and Memory Impact

Checkpoint	Format	Usage	Estimated VRAM
`Mistral-Small-4-119B-2603`	BF16 + FP8	Production (reference)	~120-240 GB
`Mistral-Small-4-119B-2603-eagle`	BF16	Speculative decoding	~240 GB
`Mistral-Small-4-119B-2603-NVFP4`	NVFP4 (4-bit)	NVIDIA optimized production	~60-75 GB
GGUF (community, upcoming)	Q4_K_M and others	Local deployment	~60-75 GB

The NVFP4 checkpoint is the recommended path for NVIDIA deployments. For Apple Silicon setups (M3 Max with 128 GB unified memory), GGUF Q4_K_M quantizations will be the option of choice once available.

Required Infrastructure for Self-Hosting

Configuration	Hardware
Minimum (H100)	4x NVIDIA HGX H100
Minimum (H200)	2x NVIDIA HGX H200
Minimum (B200)	1x NVIDIA DGX B200
Recommended (H100)	4x NVIDIA HGX H100
Recommended (H200)	4x NVIDIA HGX H200
Recommended (B200)	2x NVIDIA DGX B200

Tensor parallelism: the reference vLLM configuration uses --tensor-parallel-size 2 for 2-GPU setups. Larger TP sizes are supported for multi-GPU configurations.

Apple Silicon case: in full BF16, the model requires ~238 GB of RAM. With 4-bit quantization, this drops to ~60-75 GB, making the model usable on an M3 Max with 128 GB of unified memory.

When to Choose Mistral Small 4 Over GPT-4o-mini or Qwen 3.5

Choose Mistral Small 4 if:

You need configurable reasoning: GPT-4o-mini offers no reasoning mode. With Small 4, you can alternate between fast mode and deep mode without switching models.
You want a self-hostable Apache 2.0 model: GPT-4o-mini is proprietary and API-only. Small 4 can run on your infrastructure with no data sent externally.
You are building multi-tool agents: native function calling inherited from Devstral, combined with reasoning and vision, makes it a strong choice for complex agentic workflows.
Long context is critical: 256K tokens, double GPT-4o-mini (128K). Ideal for codebase analysis, legal documents, or technical reports.

Prefer GPT-4o-mini if:

You do not need self-hosting and the simplicity of the OpenAI API is a priority
Your request volume is low and the infrastructure cost of a multi-GPU deployment is not justified
You need stable, well-documented models with a mature third-party tool ecosystem

Prefer Qwen 3.5-122B if:

You need very long context (262K) comparable to Mistral Small 4 with an alternative MoE ecosystem
Your use cases primarily target Asia (Qwen excels on Chinese and Asian benchmarks)
Note: Qwen generates significantly longer outputs, which increases production costs

Technical Limitations to Anticipate

Heavy infrastructure: 119B total parameters require at minimum 4x H100. This is not deployable on an RTX 4090 (24 GB VRAM), even with quantization. The r/LocalLLaMA community has highlighted this as the main frustration.

Incomplete llama.cpp support at launch: PR #20649 is open but not merged. No official GGUF, no working Ollama support for now.

vLLM: fixes not merged into main: you need to use the Mistral Docker image or dedicated fork for tool calling and reasoning parser. Fixes should be integrated within 1-2 weeks of launch.

Transformers: FP8 workaround required: FP8 weights (F8_E4M3) are not natively supported by HuggingFace Transformers. Manual conversion to BF16 is required.

No lightweight variant: unlike the Small 3 family (Ministral 3B/8B/14B), there is no lightweight companion model for single-GPU or edge deployment.

API pricing not published: prices on the Mistral API are not yet available. Expect a range between Small 3.1 (~$0.10/1M tokens input) and Medium 3.1 (~$0.40/1M tokens input).

Training data not documented: no information about the training dataset has been disclosed.

For Agencies: Integrating Mistral Small 4 Into Client Projects

Mistral Small 4 is particularly well suited for agencies and AI development studios for several reasons:

One model to master and maintain instead of three specialized ones, simplifying team training and project maintenance
Apache 2.0 license: you can deploy it at client sites with no license fees and create derivative products
Per-request configurable reasoning: you can design workflows where AI complexity automatically adapts to task difficulty, optimizing costs for your clients
Fine-tuning possible via Axolotl: you can create specialized versions for client verticals (legal, healthcare, finance)

The model is available now on Hugging Face, via the Mistral API, and on NVIDIA NIM.

Want to automate?

Free 30-min audit. We identify your 3 AI quick wins.

Book a free audit →