Tutorial on How to Install and Run Locally DeepSeek Janus-Pro-7B Multimodal AI model on Linux Ubuntu. – Fusion of Engineering, Control, Coding, Machine Learning, and Science

– In this tutorial, we explain how to download, install, and run the DeepSeek Janus-Pro-7B multimodal understanding model. In this tutorial, we explain how to install this model on a computer running the Linux Ubuntu operating system.

– First, let us explain what are multimodal understanding models and multimodal deep learning. Multimodal understanding is the ability to interpret, analyze, describe, and understand multiple and simultaneous sources of visual, sound, text information, and data. In other words, multimodal understanding or multimodal learning is a branch of deep learning that integrates and analyzes simultaneously different types of data, such as audio, images, videos, and text. These different types of data are usually called modalities.

– Janus-Pro-7B is a multimodal understanding model. It can also be used for image generation.

– In this tutorial, we will explain how to use Janus-Pro-7B for multimodal tasks, and in the next one, we will explain how to use it for image generation.

The YouTube tutorial is given below.

Installation Instructions

First of all, we need to verify that the NVIDIA CUDA Toolkit and the NVCC compiler are installed on the system. For that purpose, open a terminal, and type the following

nvcc --version

If this command return a reply, this means that the NVIDIA CUDA Toolkit and the NVCC compiler are installed on the system. If not, that if this command returns an error, then you need to install the NVIDIA CUDA toolkit.

The test image is given here. You can freely download this image. Save this image as “test1.png”

The code for downloading the model from the Hugginface website is given below.

from huggingface_hub import snapshot_download

snapshot_download(repo_id="deepseek-ai/Janus-Pro-7B",
                  local_dir="/home/aleksandar/Janus")

The Python code for testing the model is given below.


import torch
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
from janus.utils.io import load_pil_images

# specify the path to the model
model_path = "/home/aleksandar/Janus"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer

vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(
    model_path, trust_remote_code=True
)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()

question="Describe the image and is the entity on the image dangerous?"
image='test1.png'
conversation = [
    {
        "role": "<|User|>",
        "content": f"<image_placeholder>\n{question}",
        "images": [image],
    },
    {"role": "<|Assistant|>", "content": ""},
]

# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation, images=pil_images, force_batchify=True
).to(vl_gpt.device)

# # run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

# # run the model to get the response
outputs = vl_gpt.language_model.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True,
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)

Installation Instructions

admin

You might also like

Clear and Concise Particle Filter Tutorial with Python Implementation- Part 1: Problem Formulation

Build DeepSeek-R1 Application With GUI and Run Locally on Windows

Install Locally DeepScaleR-1.5B with GUI: Install Best Small Large Language Model for Solving Advanced Mathematics Problems