
Vision language models (VLMs) are artificial intelligence (AI) models that blend computer vision and natural language processing (NLP) capabilities.
These models learn to map relationships between text data and visual data, such as images or videos. This ability allows VLMs to generate text from visual inputs or interpret natural language prompts within the context of visual information.
Also called visual language models, VLMs combine large language models (LLMs) with vision models or visual machine learning (ML) algorithms.
As multimodal AI systems, VLMs accept text and images or videos as input and produce text output. This output often takes the form of image or video descriptions, answers to questions about an image, or identification of parts of an image or objects in a video.
Elements of a Vision Language Model
Typically, vision language models consist of two key components:
-
A language encoder
-
A vision encoder
Language encoder
This component captures semantic meaning and contextual associations between words and phrases. It transforms them into text embeddings that AI models can process.
Most VLMs use a neural network architecture called the transformer model for their language encoder. Notable examples include Google’s BERT (Bidirectional Encoder Representations from Transformers), one of the foundational models behind many current LLMs, and OpenAI’s generative pretrained transformer (GPT).
A brief overview of the transformer architecture includes:
-
Encoders convert input sequences into numerical embeddings that capture the meaning and position of tokens.
-
A self-attention mechanism enables transformers to “focus” on the most important tokens, regardless of their position.
-
Decoders use this attention and the embeddings to generate the most probable output sequence.
Vision encoder
This part extracts essential visual features like colors, shapes, and textures from an image or video input. It then converts these features into vector embeddings that machine learning models can interpret.
Earlier VLM versions relied on deep learning algorithms such as convolutional neural networks for feature extraction. Today’s vision language models often employ vision transformers (ViT), which borrow elements from transformer-based language models.
A ViT splits an image into patches and treats them as sequences, similar to tokens in a language transformer. It applies self-attention across these patches to form a transformer-based representation of the input image.
Training Vision Language Models
Training strategies align and fuse information from both vision and language encoders. This process enables the VLM to correlate images with text and jointly reason about both modalities.
Common training methods include:
-
Contrastive learning
-
Masking
-
Generative model training
-
Use of pretrained models
Contrastive learning
This technique maps image and text embeddings from both encoders into a shared space. The VLM is trained on image-text pairs and learns to minimize the distance between matching pairs while maximizing it for nonmatching ones.
An example is CLIP (Contrastive Language-Image Pretraining). Trained on 400 million image-caption pairs from the internet, CLIP demonstrates strong zero-shot classification accuracy.
Masking
Masking teaches VLMs to predict randomly hidden parts of text or images. Masked language modeling involves filling in missing words in captions when the image is unmasked. Masked image modeling reconstructs hidden pixels in images given an unmasked caption.
FLAVA (Foundational Language And Vision Alignment) exemplifies this approach. It uses a vision transformer for images and transformers for language and multimodal encoding. Its multimodal encoder applies cross-attention to fuse textual and visual data. FLAVA’s training combines masking and contrastive learning.
Generative model training
This training enables VLMs to generate new data. Text-to-image generation creates images from text prompts. Conversely, image-to-text generation produces captions or summaries from images.
Popular text-to-image models include diffusion models like Google’s Imagen, Midjourney, OpenAI’s DALL-E (starting with DALL-E 2), and Stability AI’s Stable Diffusion.
Pretrained models
Because training VLMs from scratch is resource-intensive, pretrained models are often used. A pretrained LLM and vision encoder can be combined, with an added mapping layer that aligns visual embeddings with the LLM input space.
LLaVA (Large Language and Vision Assistant) illustrates this approach. It merges the Vicuna LLM and the CLIP ViT vision encoder via a linear projector to create a multimodal model.
Gathering high-quality training data for VLMs can be challenging. However, existing datasets facilitate pretraining, optimization, and fine-tuning for specific downstream tasks.