Skip to content

Multimodal Support

The platform now supports multimodal capabilities, enabling Agents to process images and other content types. Vision Language Model (VLM) support is fully implemented and ready to use.

Development Phases

Phase Modality Core Capabilities Status
Phase 1 Vision (VLM) Image analysis, vision-language dialogue ✅ Implemented
Future More Modalities Continuous expansion Planned

Phase 1: Vision Language Models (VLM)

Vision language model support is now available, giving Agents the ability to "see" and understand images.

Implemented Features

  • Image Input: Upload images in conversations for Agents to analyze and answer questions
  • Vision-Language Analysis: Understand text, objects, scenes, and other information in images
  • VLM Model Configuration: Add vision-capable models in model management
  • Conversation History: Save conversation records containing images

How to Use

  1. Configure a VLM Model: Go to the Model Management page and add a vision-capable model
  2. Upload Images: In any conversation, click the image upload button or drag and drop images directly into the chat
  3. Ask Questions: The Agent will analyze the image and respond to your questions about its content
  4. Multi-turn Dialogue: Continue the conversation with follow-up questions about the image

Future Plans

After vision capabilities stabilize, we will gradually expand to image generation, audio interaction, video processing, and other modalities, bringing richer perception and creation capabilities to Agents.


Vision capabilities are now live! We will continue to monitor multimodal technology developments to bring additional modalities and richer interaction capabilities to the platform.