March 20, 2025

How to install and Run Locally NVIDIA’s Llama-3.1-Nemotron-Nano-8B-v1 Large Language Model on Windows

In this machine-learning and large language model (LLM) tutorial, we explain how to install and run locally NVIDIA’s Llama-3.1-Nemotron-Nano-8B-v1 LLM on a Windows computer. This model is derived from the Llama 3.1 model, and the purpose of this model is for RAG and tool-calling applications. The YouTube tutorial is given below.

How to install and run NVIDIA’s Llama-3.1-Nemotron-Nano-8B-v1 Large Language Model on Windows

Before installing this model, it is important to first install

  1. NVIDIA CUDA Toolkit 12.4.
  2. Python 3.12.X (where X can be any number). Note that PyTorch will not work on Windows with Python 3.13.X. Consequently, you need to install Python 3.12 or an older version.

The next step is to open a Command Prompt in Windows and verify that the CUDA Toolkit and Python are properly installed. To do that, type in the Command Prompt:

nvcc --version
python --version

This should return the NVIDIA NVCC compiler version (installed with NVIDIA CUDA Toolkit) and Python version. You should see CUDA 12.3 and Python 3.12.X (where X can be any number).

Next, we need to create a workspace folder and create a Python virtual environment. Workspace folder:

cd\
mkdir testModel
cd testModel

Create and activate the Python virtual environment

python -m venv env1
env1\Scripts\activate.bat

Install the necessary libraries. First, we need to install PyTorch with the CUDA support. Go to this website, and use the selection table to generate the installation command given below the figure.

The installation command for PyTorch is

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

Next, we need to install the Huggingface library, as well as the Transformers and Accelerate libraries:

pip install huggingface_hub
pip install transformers 
pip install accelerate

After installing these libraries, write and execute the following test code:

import torch
import transformers

model_id = "nvidia/Llama-3.1-Nemotron-Nano-8B-v1"
model_kwargs = {"torch_dtype": torch.bfloat16, "device_map": "auto"}
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id

pipeline = transformers.pipeline(
   "text-generation",
   model=model_id,
   tokenizer=tokenizer,
   max_new_tokens=32768,
   temperature=0.6,
   top_p=0.95,
   **model_kwargs
)

# Thinking can be "on" or "off"
thinking = "on"

print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}"}, {"role": "user", "content": "Solve x*(sin(x)+2)=0"}]))

Make sure that you execute this code in the created Python virtual environment. When you run this code for the first time, the model weights will be downloaded from the Hugginface website. Next time you run this code, the model will be automatically loaded form the local disk.