Multimodal Support¶
The platform will soon support multimodal capabilities, enabling Agents to process images, audio, video, and other content types.
Development Phases¶
| Phase | Modality | Core Capabilities | Status |
|---|---|---|---|
| Phase 1 | Vision (VLM) | Image analysis, vision-language dialogue | Coming Soon |
| Future | More Modalities | Continuous expansion | Planned |
Phase 1: Vision Language Models (VLM)¶
Priority support for vision language models, giving Agents the ability to "see".
Upcoming Features¶
- Image Input: Upload images in conversations for Agents to analyze and answer questions
- Vision-Language Analysis: Understand text, objects, scenes, and other information in images
- VLM Model Configuration: Add vision-capable models in model management
- Conversation History: Save conversation records containing images
Future Plans¶
After vision capabilities stabilize, we will gradually expand to image generation, audio interaction, video processing, and other modalities, bringing richer perception and creation capabilities to Agents.
We will continue to monitor multimodal technology developments to bring richer interaction capabilities to the platform.