Multimodal Support¶

The platform now supports multimodal capabilities, enabling Agents to process images and other content types. Vision Language Model (VLM) support is fully implemented and ready to use.

Development Phases¶

Phase	Modality	Core Capabilities	Status
Phase 1	Vision (VLM)	Image analysis, vision-language dialogue	✅ Implemented
Future	More Modalities	Continuous expansion	Planned

Phase 1: Vision Language Models (VLM)¶

Vision language model support is now available, giving Agents the ability to "see" and understand images.

Implemented Features¶

Image Input: Upload images in conversations for Agents to analyze and answer questions
Vision-Language Analysis: Understand text, objects, scenes, and other information in images
VLM Model Configuration: Add vision-capable models in model management
Conversation History: Save conversation records containing images

How to Use¶

Configure a VLM Model: Go to the Model Management page and add a vision-capable model
Upload Images: In any conversation, click the image upload button or drag and drop images directly into the chat
Ask Questions: The Agent will analyze the image and respond to your questions about its content
Multi-turn Dialogue: Continue the conversation with follow-up questions about the image

Future Plans¶

After vision capabilities stabilize, we will gradually expand to image generation, audio interaction, video processing, and other modalities, bringing richer perception and creation capabilities to Agents.

Vision capabilities are now live! We will continue to monitor multimodal technology developments to bring additional modalities and richer interaction capabilities to the platform.