Microsoft has officially expanded its Phi-4 series with the introduction of Phi-4-mini-instruct (3.8B) and Phi-4-multimodal (5.6B), complementing the previously released Phi-4 (14B) model known for its advanced reasoning capabilities. These additions significantly enhance multilingual support, reasoning, and mathematical skills, and introduce multimodal capabilities.
This lightweight open multimodal model integrates text, vision, and speech processing, offering seamless interaction across different data formats. With 128K token context length and 5.6B parameters, Phi-4 Multimodal stands out as a powerful tool optimized for both on-device execution and low-latency inference.
In this article, we will dig deep into Phi-4-multimodal, a state-of-the-art multimodal small language model (SLM) capable of processing text, vision, and audio inputs. We will also explore practical hands-on implementations, helping developers integrate generative AI into real-world applications.
Phi-4-multimodal is a cutting-edge AI model designed to process multiple input types. Here’s what makes it stand out:
Phi-4 Multimodal is built to process text, vision, and audio inputs, making it highly versatile. Here’s a breakdown of language support for each modality:
Modality | Supported Languages |
---|---|
Text | Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian |
Vision | English |
Audio | English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese |
Phi-4’s mixture-of-LoRAs architecture allows simultaneous processing of speech, vision, and text. Unlike earlier models that required distinct sub-models, Phi-4 treats all inputs within the same framework, significantly improving efficiency and coherence.
Phi-4 performs exceptionally well in tasks that require chart/table understanding and document reasoning, thanks to its ability to synthesize vision and audio inputs. Benchmarks indicate higher accuracy compared to other state-of-the-art multimodal models, particularly in structured data interpretation.
The core Phi-4 Mini model is responsible for reasoning, generating responses, and fusing multimodal information.
The benchmarks likely assess the models’ capabilities in AI2D, ChartQA, DocVQA, and InfoVQA, which are standard datasets for evaluating multimodal models, particularly in visual question-answering (VQA) and document understanding.
Microsoft provides open-source resources that allow developers to explore Phi-4-multimodal’s capabilities. Below, we explore practical applications using Phi-4 multimodal.
!pip flash_attn==2.7.4.post1 torch==2.6.0 transformers==4.48.2 accelerate==1.3.0 soundfile==0.13.1 pillow==11.1.0 scipy==1.15.2 torchvision==0.21.0 backoff==2.2.1 peft==0.13.2
import requests
import torch
import os
import io
from PIL import Image
import soundfile as sf
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from urllib.request import urlopen
model_path = "microsoft/Phi-4-multimodal-instruct"
# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
attn_implementation='flash_attention_2',
).cuda()
generation_config = GenerationConfig.from_pretrained(model_path)
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
print("\n--- IMAGE PROCESSING ---")
image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')
image = Image.open(requests.get(image_url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors='pt').to('cuda:0')
generate_ids = model.generate(
**inputs,
max_new_tokens=1000,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')
The image shows a street scene with a red stop sign in the foreground. The
stop sign is mounted on a pole with a decorative top. Behind the stop sign,
there is a traditional Chinese building with red and green colors and
Chinese characters on the signboard. The building has a tiled roof and is
adorned with red lanterns hanging from the eaves. There are several people
walking on the sidewalk in front of the building. A black SUV is parked on
the street, and there are two trash cans on the sidewalk. The street is
lined with various shops and signs, including one for 'Optus' and another
for 'Kuo'. The overall scene appears to be in an urban area with a mix of
modern and traditional elements.
similarly, you can also for audio processing
print("\n--- AUDIO PROCESSING ---")
audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac"
speech_prompt = "Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the original transcript and the translation."
prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')
# Downlowd and open audio file
audio, samplerate = sf.read(io.BytesIO(urlopen(audio_url).read()))
# Process with the model
inputs = processor(text=prompt, audios=[(audio, samplerate)], return_tensors='pt').to('cuda:0')
generate_ids = model.generate(
**inputs,
max_new_tokens=1000,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')
Use Case:
1. Image Analysis by Phi-4 Multimodal
2. Mathematics Image Analysis by Phi-4 Multimodal
One of the standout aspects of Phi-4-multimodal is its ability to operate on edge devices, making it an ideal solution for IoT applications and environments with limited computing resources.
Potential Edge Deployments:
Microsoft’s Phi-4 Multimodal is a breakthrough in AI, seamlessly integrating text, vision, and speech processing in a compact, high-performance model. Ideal for AI assistants, document processing, and multilingual applications, it unlocks new possibilities in smart, intuitive AI solutions.
For developers and researchers, hands-on access to Phi-4 enables cutting-edge innovation—from code generation to real-time voice translation and IoT applications—pushing the boundaries of multimodal AI.