January 16, 2025

How to Compile and Build the GPU version of llama.cpp from source and run LLM models on GPU

In this machine learning and large language model tutorial, we explain how to compile and build llama.cpp program with GPU support from source on Windows. For readers of this tutorial who are not familiar with llama.cpp, llama.cpp is a program for running large language models (LLMs) locally. You can run the model with a single command line. Once we explain how to build llama.cpp, we explain how to run Microsoft’s Phi-4 LLM. The YouTube tutorial is given below.

The main reason for building llama.cpp from scratch comes from the fact that our experience shows that the binary version of llama.cpp that can be found online does not fully exploit the GPU resources. To make sure that that llama.cpp fully exploits the GPU card, we need to build llama.cpp from scratch by using the CUDA and C++ compilers.

Prerequisites

Install and Build Lllama.cpp

Next, open Microsoft Visual Studio Developer Command Prompt, and first verify that CUDA compilers are installed and they are in the path

To verify that the CUDA compilers are installed, type

nvcc --version

The output should be similar to

Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Oct_30_01:18:48_Pacific_Daylight_Time_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0

Then verify that Git is installed, by simply typing

git

and you should get the generic response on how to use Git. Then create a workspace folder, and clone the remote llama.cpp repository

cd\
mkdir testLlama
git clone https://github.com/ggerganov/llama.cpp

Then, navigate the llama.cpp and build the project

cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

It will take around 20-30 minutes to build everything. Once llama.cpp is compiled, then go to the Huggingface website and download the Phi-4 LLM file called phi-4-gguf.

Then, copy this model file to

C:\testLlama\llama.cpp\build\bin\Release>

where the executable files of llama.cpp are located. To run the model, navigate to this folder

cd C:\testLlama\llama.cpp\build\bin\Release

and run the model by typing

llama-cli --model phi-4-q4.gguf -cnv -c 16384

This will start the model Phi-4 in the interactive mode and ask the question to the model.