See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction

[Code] [Paper] (Coming soon!)


Cognitive planning is the structural decomposition of complex tasks into a sequence of future behaviors. In the computational setting, performing cognitive planning entails grounding plans and concepts in one or more modalities in order to leverage them for low level control. Since real-world tasks are often described in natural language, we devise a cognitive planning algorithm via language-guided video prediction. Current video prediction models do not support conditioning on natural language instructions. We, therefore, propose a new video prediction architecture which leverages the power of pre-trained transformers. The network is endowed with the ability to ground concepts based on natural language input with generalization to unseen objects. We demonstrate the effectiveness of this approach on a new simulation dataset, where each task is defined by a high-level action described in natural language. Our experiments compare our method against one video generation baseline without planning or action grounding and showcase significant improvements. Our ablation studies highlight the generalization power that natural language embeddings offer to concept grounding ability, as well as the importance of planning towards visual "imagination" of a task. Supplemental visual and other materials can be found at:

Interactive Demo

Text Prompt: Spell the word


Model Overview


In order to assess the importance of the two submodules of our system, two ablation studies were performed. Results with absence of a planner indicate that often high level task descriptions do not contain enough information to perform the task and highlight the value of a planner breaking down a task into lower level actions. Substituting pre-trained natural language embeddings with one-hot encodings as the language representation of choice also demonstrates the power of language embeddings pre-trained in tandem with visual embeddings, for concept learning.

Text Prompt: Spell the word
See-PP without planner
See-PP with one-hot encodings


Our dataset was created based on Ravens and can be found here (link coming soon!).

Generalization Results

Generalization results on the Spelling dataset. Language instructions are passed in along with the initial visual observation. Comparison between 3 models, See-PP trained on a random train/test split, See-PP trained on a seen/unseen train/test split where 4 letters (F, Q, X, Z) were kept unseen during training.

Keyframe and Dense Prediction Results

Two modes of prediction. Dense prediction is longer horizon and aims to generate smoother transitions and more realistic videos. Keyframe prediction is shorter horizon and, while more fragmented, can yield better results that can be useful towards low level control in robotic settings, through Imitation Learning approaches.

Text Prompt: Spell the word TIDI

Keyframe Dense