How to install and Run Locally Microsoft’s Phi-4-multimodal-instruct – Best Lightweight Multimodal AI model – Fusion of Engineering, Control, Coding, Machine Learning, and Science

In this AI and machine learning tutorial, we explain how to install and run locally Microsoft’s Phi-4-multimodal instruct. This AI multimodal model can handle a number of tasks, such as vision language understanding, speech to text recognition and translation, optical character recognition, image understanding, video understanding, etc. The YouTube tutorial with all the installation instructions is given below.

Local Installation of Microsoft’s Phi-4-multimodal-instruct

The installation instructions given below are assuming that you are running the AI model on a Linux Ubuntu computer. First of all, you need to have NVIDIA CUDA Toolkit and NVIDIA CUDA Compilers installed on the system. To install the NVIDIA CUDA Toolkit and compilers, see the tutorial given here.

Next, open a terminal and type

sudo apt update && sudo apt upgrade

Next, double check that the NVIDIA compilers and the GCC compiler are installed:

nvcc --version
gcc --version

Next, create a workspace folder

cd ~
mkdir Phi4Test
cd Phi4Test

Next, create and activate the Python virtual environment

sudo apt install python3.12-venv
python3 -m venv env1
source env1/bin/activate

Next, create a Python requirement file.

sudo nano requirements.txt

Add the following dependencies to the file

flash_attn==2.7.4.post1
torch==2.6.0
transformers==4.48.2
accelerate==1.3.0
soundfile==0.13.1
pillow==11.1.0
scipy==1.15.2
torchvision==0.21.0
backoff==2.2.1
peft==0.13.2

To install these dependencies save this file and go back to the terminal and type

pip install -r requirements.txt

Next, install the Huggingface libraries:

pip install huggingface-hub
pip install "huggingface_hub[cli]"

Next, create the model folder that will be used to download the model files

mkdir model

Then, download the model by executing this line

huggingface-cli download microsoft/Phi-4-multimodal-instruct --local-dir ./model

This will download the model files. Next, we need to write a test file. Open your favorite Python editor (in our case VS Code), and type this test code:

import requests
import torch
import os
import io
from PIL import Image
import soundfile as sf
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from urllib.request import urlopen


# Define model path
model_path = "microsoft/Phi-4-multimodal-instruct"

# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True, 
    attn_implementation='flash_attention_2',
).cuda()

# Load generation config
generation_config = GenerationConfig.from_pretrained(model_path)

# Define prompt structure
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'

# Part 1: Image Processing
print("\n--- IMAGE PROCESSING ---")
image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

# Download and open image
image = Image.open(requests.get(image_url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors='pt').to('cuda:0')

# Generate response
generate_ids = model.generate(
    **inputs,
    max_new_tokens=1000,
    generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

# Part 2: Audio Processing
print("\n--- AUDIO PROCESSING ---")
audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac"
speech_prompt = "Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the original transcript and the translation."
prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

# Downlowd and open audio file
audio, samplerate = sf.read(io.BytesIO(urlopen(audio_url).read()))

# Process with the model
inputs = processor(text=prompt, audios=[(audio, samplerate)], return_tensors='pt').to('cuda:0')

generate_ids = model.generate(
    **inputs,
    max_new_tokens=1000,
    generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

This code fill download an image from the Internet, and understand the image. Also, this code will download a remote audio file and recognize the text and then translate the text to French. For more details on how to use this model, see the YouTube tutorial.

How to install and Run Locally Microsoft’s Phi-4-multimodal-instruct – Best Lightweight Multimodal AI model

Local Installation of Microsoft’s Phi-4-multimodal-instruct

References

admin

Local Installation of Microsoft’s Phi-4-multimodal-instruct

References

admin

You might also like

How to Install and Run DeepSeek-V3 Model Locally on GPU or CPU

How to Download and Run Locally Mistral Small 3 on Windows 11

Install and Run Official Microsoft’s Phi4 LLM Locally in Python on Windows