Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer⁺

Adam Polyak⁺

Thomas Hayes⁺

Xi Yin⁺

Jie An

Songyang Zhang

Qiyuan Hu

Harry Yang

Oron Ashual

Oran Gafni

Devi Parikh⁺

Sonal Gupta⁺

Yaniv Taigman⁺

⁺Core Contributors

Abstract

We propose Make-A-Video -- an approach for directly translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V). Our intuition is simple: learn what the world looks like and how it is described from paired text-image data, and learn how the world moves from unsupervised video footage. Make-A-Video has three advantages: (1) it accelerates training of the T2V model (it does not need to learn visual and multimodal representations from scratch), (2) it does not require paired text-video data, and (3) the generated videos inherit the vastness (diversity in aesthetic, fantastical depictions, etc.) of today's image generation models. We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules. First, we decompose the full temporal U-Net and attention tensors and approximate them in space and time. Second, we design a spatial temporal pipeline to generate high resolution and frame rate videos with a video decoder, interpolation model and two super resolution models that can enable various applications besides T2V. In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation, as determined by both qualitative and quantitative measures.

There is a table by a window with sunlight streaming through illuminating a pile of books.	A blue unicorn flying over a mystical land.	A dog wearing a Superhero outfit with red cape flying through the sky.	Clown fish swimming through the coral reef.	Robot dancing in times square.	A litter of puppies running through the yard.

A knight riding on a horse through the countryside.	A panda playing on a swing set.	A fluffy baby sloth with a knitted hat trying to figure out a laptop, close up, highly detailed, studio lighting, screen reflecting in its eyes.	A teddy bear painting a portrait.	A young couple walking in a heavy rain.	A bear driving a car.

A beautiful scenery which leads to a jumpscare.	Sailboat sailing on a sunny day in a mountain lake.	A musk ox grazing on beautiful wildflowers.	Two kangaroos are busy cooking dinner in a kitchen.	A small domesticated carnivorous mammal with soft fur, a short snout, and retractable claws.	A spaceship being pulled into a blackhole.

An artists brush painting on a canvas close up.	An emoji of a baby panda wearing a red hat, blue gloves, green shirt, and blue pants.	Cat watching tv with a remote in hand.	Horse drinking water.	Humans building a highway on mars.	Hyper-realistic photo of an abandoned industrial site during a storm.

A ufo hovering over aliens in a field.	A golden retriever eating ice cream on a beautiful tropical beach at sunset.	Monkey learning to play the piano.	Photo of a cat singing in a barbershop quartet.	A ballerina performs a beautiful and difficult dance on the roof of a very tall skyscraper, the city is lit up and glowing behind her.	Unicorns running along a beach.

a teddy bear on a skateboard in Times Square.	A ballerina performs a beautiful and difficult dance on the roof of a very tall skyscraper, the city is lit up and glowing behind her.	A panda playing on a swing set.	A fluffy baby sloth with a knitted hat trying to figure out a laptop, close up, highly detailed, studio lighting, screen reflecting in its eyes.	A teddy bear painting a portrait.

	4K Illuminated Christmas Tree at Night During Snowstorm	Construction Site Activity	Digital Information	Dramatic ocean sunset	fire
Video Diffusion Models
CogVideo
Make-A-Video (Ours)
	Irrigation Canal in Western USA Water Sourced from the Colorado River 4K Aerial Video	Mountain river	Shinjuku Time Lapse	snowfall in city	Star Time Lapse in Japan
Video Diffusion Models
CogVideo
Make-A-Video (Ours)

Make-A-Video: Text-to-Video Generation without Text-Video Data

Text-to-Video

Video Variations

Long Video Generation

Image Animation

Image Interpolation

T2V Generation Comparison

Image Interpolation Comparison

Original Image
Animated Video

First Image
Second Image
Interpolated Video