In this AI and machine learning tutorial, we explain how to install and run locally Microsoft’s Phi-4-multimodal instruct. This AI multimodal model can handle a number of tasks, such as vision language understanding, speech to text recognition and translation, optical character recognition, image understanding, video understanding, etc. The YouTube tutorial with all the installation instructions is given below.
Local Installation of Microsoft’s Phi-4-multimodal-instruct
The installation instructions given below are assuming that you are running the AI model on a Linux Ubuntu computer. First of all, you need to have NVIDIA CUDA Toolkit and NVIDIA CUDA Compilers installed on the system. To install the NVIDIA CUDA Toolkit and compilers, see the tutorial given here.
Next, open a terminal and type
sudo apt update && sudo apt upgrade
Next, double check that the NVIDIA compilers and the GCC compiler are installed:
nvcc --version
gcc --version
Next, create a workspace folder
cd ~
mkdir Phi4Test
cd Phi4Test
Next, create and activate the Python virtual environment
sudo apt install python3.12-venv
python3 -m venv env1
source env1/bin/activate
Next, create a Python requirement file.
sudo nano requirements.txt
Add the following dependencies to the file
flash_attn==2.7.4.post1
torch==2.6.0
transformers==4.48.2
accelerate==1.3.0
soundfile==0.13.1
pillow==11.1.0
scipy==1.15.2
torchvision==0.21.0
backoff==2.2.1
peft==0.13.2
To install these dependencies save this file and go back to the terminal and type
pip install -r requirements.txt
Next, install the Huggingface libraries:
pip install huggingface-hub
pip install "huggingface_hub[cli]"
Next, create the model folder that will be used to download the model files
mkdir model
Then, download the model by executing this line
huggingface-cli download microsoft/Phi-4-multimodal-instruct --local-dir ./model
This will download the model files. Next, we need to write a test file. Open your favorite Python editor (in our case VS Code), and type this test code:
import requests
import torch
import os
import io
from PIL import Image
import soundfile as sf
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from urllib.request import urlopen
# Define model path
model_path = "microsoft/Phi-4-multimodal-instruct"
# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
attn_implementation='flash_attention_2',
).cuda()
# Load generation config
generation_config = GenerationConfig.from_pretrained(model_path)
# Define prompt structure
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# Part 1: Image Processing
print("\n--- IMAGE PROCESSING ---")
image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')
# Download and open image
image = Image.open(requests.get(image_url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors='pt').to('cuda:0')
# Generate response
generate_ids = model.generate(
**inputs,
max_new_tokens=1000,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')
# Part 2: Audio Processing
print("\n--- AUDIO PROCESSING ---")
audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac"
speech_prompt = "Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the original transcript and the translation."
prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')
# Downlowd and open audio file
audio, samplerate = sf.read(io.BytesIO(urlopen(audio_url).read()))
# Process with the model
inputs = processor(text=prompt, audios=[(audio, samplerate)], return_tensors='pt').to('cuda:0')
generate_ids = model.generate(
**inputs,
max_new_tokens=1000,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')
This code fill download an image from the Internet, and understand the image. Also, this code will download a remote audio file and recognize the text and then translate the text to French. For more details on how to use this model, see the YouTube tutorial.