How to Compile and Build the GPU version of llama.cpp from source and run LLM models on GPU – Fusion of Engineering, Control, Coding, Machine Learning, and Science

In this machine learning and large language model tutorial, we explain how to compile and build llama.cpp program with GPU support from source on Windows. For readers of this tutorial who are not familiar with llama.cpp, llama.cpp is a program for running large language models (LLMs) locally. You can run the model with a single command line. Once we explain how to build llama.cpp, we explain how to run Microsoft’s Phi-4 LLM. The YouTube tutorial is given below.

The main reason for building llama.cpp from scratch comes from the fact that our experience shows that the binary version of llama.cpp that can be found online does not fully exploit the GPU resources. To make sure that that llama.cpp fully exploits the GPU card, we need to build llama.cpp from scratch by using the CUDA and C++ compilers.

Prerequisites

You need to install Microsoft Visual Studio C++ and Microsoft Visual Studio C++ compilers. This is necessary for building the llama.cpp from source. To install Microsoft Visual Studio C++, go to this website and download and install Microsoft Visual Studio C++ Community edition.
You need to install the CUDA Toolkit in order to be able to compile lamma.cpp with CUDA support. To download and install the CUDA Toolkit, go to this website and download and later on install the file.
You need to install Git such that you can download the source files. To install and download Git, and download and install the Git installation file.

Install and Build Lllama.cpp

Next, open Microsoft Visual Studio Developer Command Prompt, and first verify that CUDA compilers are installed and they are in the path

To verify that the CUDA compilers are installed, type

nvcc --version

The output should be similar to

Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Oct_30_01:18:48_Pacific_Daylight_Time_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0

Then verify that Git is installed, by simply typing

git

and you should get the generic response on how to use Git. Then create a workspace folder, and clone the remote llama.cpp repository

cd\
mkdir testLlama
git clone https://github.com/ggerganov/llama.cpp

Then, navigate the llama.cpp and build the project

cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

It will take around 20-30 minutes to build everything. Once llama.cpp is compiled, then go to the Huggingface website and download the Phi-4 LLM file called phi-4-gguf.

Then, copy this model file to

C:\testLlama\llama.cpp\build\bin\Release>

where the executable files of llama.cpp are located. To run the model, navigate to this folder

cd C:\testLlama\llama.cpp\build\bin\Release

and run the model by typing

llama-cli --model phi-4-q4.gguf -cnv -c 16384

This will start the model Phi-4 in the interactive mode and ask the question to the model.

Prerequisites

Install and Build Lllama.cpp

admin

You might also like

What is Importance Sampling? – Detailed Explanation with Python Implementation and Simulation

How To Properly Simulate a Stochastic (Random) Linear State-Space Model in Python

Kokoro-82M: Install and Run Locally Fast, Small, and Free Text-To-Speech (TTS) AI Model Kokoro-82M