Cognitive planning is the structural decomposition of complex tasks into a sequence of future behaviors.
In the computational setting, performing cognitive planning entails grounding plans and concepts in one or
more modalities in order to leverage them for low level control. Since real-world tasks are often described in
natural language, we devise a cognitive planning algorithm via language-guided video prediction.
Current video prediction models do not support conditioning on natural language instructions.
We, therefore, propose a new video prediction architecture which leverages the power of pre-trained transformers.
The network is endowed with the ability to ground concepts based on natural language input with generalization to unseen objects.
We demonstrate the effectiveness of this approach on a new simulation dataset, where each task is defined by a high-level action described in natural language.
Our experiments compare our method against one video generation baseline without planning or action grounding and showcase significant improvements.
Our ablation studies highlight the generalization power that natural language embeddings offer to concept grounding ability, as well as
the importance of planning towards visual "imagination" of a task.
Supplemental visual and other materials can be found at: see-pp.github.io
In order to assess the importance of the two submodules of our system, two ablation studies were performed. Results with absence of a planner indicate that
often high level task descriptions do not contain enough information to perform the task and highlight the value of a planner breaking down a task into lower level
actions. Substituting pre-trained natural language embeddings with one-hot encodings as the language representation of choice also demonstrates the power of
language embeddings pre-trained in tandem with visual embeddings, for concept learning.
Generalization results on the Spelling dataset.
Language instructions are passed in along with the initial visual observation. Comparison between 3 models, See-PP trained on a random train/test split,
See-PP trained on a seen/unseen train/test split where 4 letters (F, Q, X, Z) were kept unseen during training.
Keyframe and Dense Prediction Results
Two modes of prediction. Dense prediction is longer horizon and aims to generate smoother transitions and more realistic videos.
Keyframe prediction is shorter horizon and, while more fragmented, can yield better results that can be useful towards low level control
in robotic settings, through Imitation Learning approaches.