Skip to content

Multimodal Support

The platform will soon support multimodal capabilities, enabling Agents to process images, audio, video, and other content types.

Development Phases

Phase Modality Core Capabilities Status
Phase 1 Vision (VLM) Image analysis, vision-language dialogue Coming Soon
Future More Modalities Continuous expansion Planned

Phase 1: Vision Language Models (VLM)

Priority support for vision language models, giving Agents the ability to "see".

Upcoming Features

  • Image Input: Upload images in conversations for Agents to analyze and answer questions
  • Vision-Language Analysis: Understand text, objects, scenes, and other information in images
  • VLM Model Configuration: Add vision-capable models in model management
  • Conversation History: Save conversation records containing images

Future Plans

After vision capabilities stabilize, we will gradually expand to image generation, audio interaction, video processing, and other modalities, bringing richer perception and creation capabilities to Agents.


We will continue to monitor multimodal technology developments to bring richer interaction capabilities to the platform.