Efficiency, Accuracy, and Openness Reimagined. Nemotron is built to power the next generation of multi-agent workflows.
Open Weights, Open Data, Open Recipes. We empower you to own your AI stack completely.
Designed specifically for tool use, multi-step reasoning, and long-context agentic workflows.
Hybrid architectures delivering small model footprint with large model intelligence.
“The Brain of Your Next Agent”
World's first Hybrid Mamba-Transformer MoE. Combines infinite context speed with Transformer precision.
31.6B Parameters, yet only 3.6B Active per token. Delivers 4x Faster Throughput than previous generations.
Perfect memory for RAG and long-document analysis, keeping context intact across long horizons.
Toggle "Thinking Mode" ON/OFF. Control your budget between speed and deep chain-of-thought reasoning.
From edge devices to data center scale, there is a model for every task.
The Judge & Teacher
Beats GPT-4o on release. Best for hard instruction following and synthetic data generation.
The Deep Researcher
High accuracy for deep research agents. Fits on a single Data Center GPU.
The Heavyweight
Maximum accuracy for rigid enterprise workflows like Supply Chain and Security.
The Infrastructure
Industry-leading embedding/reranking models and safety guardrails.
We didn't just open the model, we opened the lab.
For full precision (BF16), you need ~60GB+ VRAM (e.g., A100). For quantized (FP8/INT4), it fits in ~20-32GB VRAM (RTX 3090/4090).
Yes. It is released under the Open Model License, allowing for commercial use and derivative works.
Yes. Nemotron features a toggleable Reasoning Mode to improve accuracy on complex tasks at the cost of more tokens.
We recommend vLLM for the fastest Nemotron inference speed on High-Performance GPUs. It is also fully supported by standard Hugging Face Transformers.