January 30, 2025

Tutorial on How to Install and Run Locally DeepSeek Janus-Pro-7B Multimodal AI model

– In this tutorial, we explain how to download, install and run locally DeepSeek Janus-Pro-7B multimodal understanding model.

First, let us explain what are multimodal understanding models and multimodal deep learning. Multimodal understanding is the ability to interpret, analyze, describe, and understand multiple and simultaneous sources of visual, sound, text information, and data. In other words, multimodal understanding or multimodal learning is a branch of deep learning that integrates and analyzes simultaneously different types of data, such as audio, images, videos, and text. These different types of data are usually called modalities.

– Janus-Pro-7B is a multimodal understanding model. It can also be used for image generation.

– In this tutorial, we will explain how to use Janus-Pro-7B for multimodal tasks, and in the next one, we will explain how to use it for image generation.

The YouTube tutorial is given below.

The test image is given here. You can freely download this image. Save this image as “test1.png”

The code for downloading the model from the Hugginface website is given below.

from huggingface_hub import snapshot_download

snapshot_download(repo_id="deepseek-ai/Janus-Pro-7B",
                  local_dir="/home/aleksandar/Janus")

The Python code for testing the model is given below.


import torch
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
from janus.utils.io import load_pil_images

# specify the path to the model
model_path = "/home/aleksandar/Janus"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer

vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(
    model_path, trust_remote_code=True
)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()

question="Describe the image and is the entity on the image dangerous?"
image='test1.png'
conversation = [
    {
        "role": "<|User|>",
        "content": f"<image_placeholder>\n{question}",
        "images": [image],
    },
    {"role": "<|Assistant|>", "content": ""},
]

# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation, images=pil_images, force_batchify=True
).to(vl_gpt.device)

# # run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

# # run the model to get the response
outputs = vl_gpt.language_model.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True,
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)