
February 20, 2025
Janus-Series: Unified Multimodal Understanding and Generation Models
The Janus-Series advances multimodal AI by seamlessly combining text, graphics, and vision-language processing to improve applications. These DeepSeek-AI models advance image interpretation, text-to-image creation, and multimodal reasoning. The series has three performance-specific modifications.
Janus-Pro, the most powerful of the three, has a 7-billion-parameter architecture for high-performance multimodal tasks, making it suited for complicated AI applications.
Janus' compact 1.3-billion-parameter model optimizes inference speed and multimodality for lighter solutions.
JanusFlow, an elegant diffusion-based image creation and interpretation variation, is ideal for creative and analytical visual activities. Our post analyzes each model's capabilities and provides quick-start recommendations for deployment.
Model Variations
Janus-Pro
The biggest model in the series, Janus-Pro, has 7B settings for tackling challenging multimodal tasks. Image captioning, visual question answering, and text-to-image creation all shine here.
Janus
Janus (1.3B) is a smaller variation of Janus-Pro that strikes performance and economy in balance. For jobs needing quick processing with little gear needed, it is perfect.
JanusFlow
Designed especially for diffusion-based image production, JanusFlow integrates excellent image synthesis with strong multimodal reasoning. It uses transformer-based image-generating models to enhance coherence and visual realism.
Quick Start
Janus-Pro
Modern multimodal model Janus-Pro can generate and interpret text as well as images.
Installation
pip install -e .
Multimodal Understanding with Janus-Pro
import torch
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
from janus.utils.io import load_pil_images
# Load the Janus-Pro model
model_path = "deepseek-ai/Janus-Pro-7B"
vl_chat_processor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt = AutoModelForCausalLM.from_pretrained(
model_path, trust_remote_code=True
).to(torch.bfloat16).cuda().eval()
# Example conversation with an image input
conversation = [
{"role": "<|User|>", "content": f"<image_placeholder>\nWhat is in this image?", "images": [image]},
{"role": "<|Assistant|>", "content": ""},
]
# Load image and process inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(conversation, images=pil_images, force_batchify=True).to(vl_gpt.device)
# Generate response
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
outputs = vl_gpt.language_model.generate(
inputs_embeds=inputs_embeds,
attention_mask=prepare_inputs.attention_mask,
max_new_tokens=512,
)
# Decode and print answer
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(answer)
Text-to-Image Generation with Janus-Pro
import os
import PIL.Image
import torch
import numpy as np
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
# Load Janus-Pro model
model_path = "deepseek-ai/Janus-Pro-7B"
vl_chat_processor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt = AutoModelForCausalLM.from_pretrained(
model_path, trust_remote_code=True
).to(torch.bfloat16).cuda().eval()
# Text prompt for image generation
conversation = [
{"role": "<|User|>", "content": "A stunning princess from Kabul in red and white traditional clothing, blue eyes, brown hair"},
{"role": "<|Assistant|>", "content": ""},
]
# Generate image
sft_format = vl_chat_processor.apply_sft_template_for_multi_turn_prompts(
conversations=conversation, sft_format=vl_chat_processor.sft_format, system_prompt=""
)
prompt = sft_format + vl_chat_processor.image_start_tag
def generate_image(model, processor, prompt):
input_ids = processor.tokenizer.encode(prompt)
input_ids = torch.LongTensor(input_ids).cuda()
generated_tokens = torch.zeros((16, 576), dtype=torch.int).cuda()
for i in range(576):
outputs = model.language_model(input_ids)
logits = model.gen_head(outputs.last_hidden_state[:, -1, :])
next_token = torch.multinomial(torch.softmax(logits / 1, dim=-1), num_samples=1)
generated_tokens[:, i] = next_token.squeeze(-1)
dec = model.gen_vision_model.decode_code(generated_tokens.to(torch.int), shape=[16, 8, 24, 24])
img_array = np.clip((dec.cpu().numpy().transpose(0, 2, 3, 1) + 1) / 2 * 255, 0, 255).astype(np.uint8)
os.makedirs("generated_samples", exist_ok=True)
for i in range(16):
PIL.Image.fromarray(img_array[i]).save(f"generated_samples/img_{i}.jpg")
generate_image(vl_gpt, vl_chat_processor, prompt)
Janus (1.3B)
A smaller version of Janus-Pro, optimized for faster inference.
Installation
pip install -e .
Multimodal Understanding Example
Janus can also process images and text inputs, like converting equations into LaTeX:
import torch
from janus.models import MultiModalityCausalLM, VLChatProcessor
from janus.utils.io import load_pil_images
# Load Janus model
model_path = "deepseek-ai/Janus-1.3B"
vl_chat_processor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt = MultiModalityCausalLM.from_pretrained(model_path, trust_remote_code=True).to(torch.bfloat16).cuda().eval()
conversation = [
{"role": "User", "content": "<image_placeholder>\nConvert the formula into LaTeX code.", "images": ["images/equation.png"]},
{"role": "Assistant", "content": ""},
]
# Load and process input
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(conversation, images=pil_images, force_batchify=True).to(vl_gpt.device)
# Generate output
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
outputs = vl_gpt.language_model.generate(inputs_embeds=inputs_embeds, attention_mask=prepare_inputs.attention_mask, max_new_tokens=512)
# Decode response
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(answer)
JanusFlow
A specialized diffusion-based model for high-quality text-to-image generation.
Installation
pip install -e .
pip install diffusers[torch]
Multimodal Understanding Example
JanusFlow can process and analyze images, similar to Janus:
import torch
from janus.janusflow.models import MultiModalityCausalLM, VLChatProcessor
from janus.utils.io import load_pil_images
model_path = "deepseek-ai/JanusFlow-1.3B"
vl_chat_processor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt = MultiModalityCausalLM.from_pretrained(model_path, trust_remote_code=True).to(torch.bfloat16).cuda().eval()
Conclusion
From text-image understanding to sophisticated diffusion-based image synthesis, the Janus-Series presents a potent array of multimodal artificial intelligence models. The Janus-Series offers innovative solutions for multimodal artificial intelligence applications whether your search is for a high-performance model (Janus-Pro), an optimized solution (Janus), or a diffusion-based generator (JanusFlow).
140 views