NExT-GPT: Any-to-Any Multimodal LLM

byChiranjeevi Maddala
September 25, 2023

Recent advances in multimodal large language models (MM-LLMs) have enabled AI systems to understand and reason about inputs across modalities like text, images, videos and audio. However, most existing models are limited to multimodal comprehension, without the ability to generate content across multiple modes.

next-gpt Download

NExT-GPT is a new multimodal large language model (MM-LLM) that can perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. It is the first end-to-end MM-LLM that can achieve this level of flexibility, and it has the potential to revolutionize the way we interact with computers.

NExT-GPT is built on top of existing pre-trained LLM, multimodal encoder, and state-of-the-art diffusion models. It uses a modality-switching instruction tuning (MosIT) technique to learn how to switch between different modalities during generation. This allows NExT-GPT to perform a wide range of tasks, such as:

Text-to-image synthesis: Generate images from text descriptions, such as “a photo of a cat sitting on a couch” or “a painting of a sunset over the ocean.”
Image-to-text synthesis: Describe images in text, such as “a photo of a black cat sitting on a red couch” or “a painting of a colorful sunset over the ocean.”
Video-to-text synthesis: Describe videos in text, such as “a video of a cat playing with a ball of yarn” or “a video of a sunset over the ocean.”
Audio-to-text synthesis: Transcribe audio to text, such as a podcast or a song.
Text-to-video synthesis: Generate videos from text descriptions, such as “a video of a cat playing with a ball of yarn” or “a video of a sunset over the ocean.”
Video-to-audio synthesis: Generate audio from videos, such as a soundtrack for a movie or a song for a music video.

NExT-GPT is still under development, but it has already demonstrated impressive capabilities. For example, it can generate realistic images of different objects and scenes, translate between different languages, and write different creative text formats of text content. NExT-GPT has the potential to be used in a wide range of applications, such as:

Creative tools: NExT-GPT can be used to create new forms of art and entertainment, such as interactive stories, video games, and movies.
Educational tools: NExT-GPT can be used to create personalized learning experiences for students. For example, it could generate interactive exercises, provide feedback on student work, and translate educational materials into different languages.
Accessibility tools: NExT-GPT can be used to develop new tools that help people with disabilities communicate and interact with the world around them. For example, it could be used to develop real-time transcription tools for people who are deaf or hard of hearing, or to develop sign language translation tools.

NExT-GPT is a powerful new tool that has the potential to change the way we interact with computers. It is still under development, but it has already demonstrated impressive capabilities. NExT-GPT has the potential to be used in a wide range of applications, such as creative tools, educational tools, and accessibility tools.

Chiranjeevi Maddala

AI Researcher & Speaker | Director - Product, Marketing & Innovation at {igebra.ai} | Building AI Systems Talks about #ai, #branding, #education, #marketing, and #innovation