Muyao Li*, Zihao Wang*, Kaichen He, Xiaojian Ma, Yitao Liang (* equal contribution)
arXiv 2025
Visual Language Action (VLA) models, pretrained on large-scale web datasets, have shown promise in decision-making tasks. However, previous work has primarily focused on action post-training, often neglecting enhancements to the foundational model itself. In response, we introduce a novel approach, Act from Visual Language Post-Training, which refines Visual Language Models (VLMs) through visual and linguistic guidance in a self-supervised manner. This enhancement improves the models' capabilities in world knowledge, visual recognition, and spatial grounding in open-world environments. Following the above post-training paradigms, we obtain the first VLA models in Minecraft that can follow human instructions on over 1k different atomic tasks, including crafting, smelting, cooking, mining, and killing.
Muyao Li*, Zihao Wang*, Kaichen He, Xiaojian Ma, Yitao Liang (* equal contribution)
arXiv 2025
Visual Language Action (VLA) models, pretrained on large-scale web datasets, have shown promise in decision-making tasks. However, previous work has primarily focused on action post-training, often neglecting enhancements to the foundational model itself. In response, we introduce a novel approach, Act from Visual Language Post-Training, which refines Visual Language Models (VLMs) through visual and linguistic guidance in a self-supervised manner. This enhancement improves the models' capabilities in world knowledge, visual recognition, and spatial grounding in open-world environments. Following the above post-training paradigms, we obtain the first VLA models in Minecraft that can follow human instructions on over 1k different atomic tasks, including crafting, smelting, cooking, mining, and killing.
Jingwen Deng*, Zihao Wang*, Anji Liu, Yitao Liang (* equal contribution)
arXiv 2025
Learning skills in open-world environments is essential for developing agents capable of handling a variety of tasks by combining basic skills. Online demonstration videos are typically long but unsegmented, making them difficult to segment and label with skill identifiers. Unlike existing methods that rely on sequence sampling or human labeling, we have developed a self-supervised learning-based approach to segment these long videos into a series of semantic-aware and skill-consistent segments. Drawing inspiration from human cognitive event segmentation theory, we introduce Skill Boundary Detection (SBD), an annotation-free temporal video segmentation algorithm. SBD detects skill boundaries in a video by leveraging prediction errors from a pretrained unconditional action-prediction model.
Jingwen Deng*, Zihao Wang*, Anji Liu, Yitao Liang (* equal contribution)
arXiv 2025
Learning skills in open-world environments is essential for developing agents capable of handling a variety of tasks by combining basic skills. Online demonstration videos are typically long but unsegmented, making them difficult to segment and label with skill identifiers. Unlike existing methods that rely on sequence sampling or human labeling, we have developed a self-supervised learning-based approach to segment these long videos into a series of semantic-aware and skill-consistent segments. Drawing inspiration from human cognitive event segmentation theory, we introduce Skill Boundary Detection (SBD), an annotation-free temporal video segmentation algorithm. SBD detects skill boundaries in a video by leveraging prediction errors from a pretrained unconditional action-prediction model.
Shaofei Cai, Zihao Wang, Kewei Lian, Zhancun Mu, Xiaojian Ma, Anji Liu, Yitao Liang
CVPR 2025
We propose visual-temporal context prompting, a novel communication protocol between VLMs and policy models. This protocol leverages object segmentation from past observations to guide policy-environment interactions. Using this approach, we train ROCKET-1, a low-level policy that predicts actions based on concatenated visual observations and segmentation masks, supported by real-time object tracking from SAM-2. Our method unlocks the potential of VLMs, enabling them to tackle complex tasks that demand spatial reasoning.
Shaofei Cai, Zihao Wang, Kewei Lian, Zhancun Mu, Xiaojian Ma, Anji Liu, Yitao Liang
CVPR 2025
We propose visual-temporal context prompting, a novel communication protocol between VLMs and policy models. This protocol leverages object segmentation from past observations to guide policy-environment interactions. Using this approach, we train ROCKET-1, a low-level policy that predicts actions based on concatenated visual observations and segmentation masks, supported by real-time object tracking from SAM-2. Our method unlocks the potential of VLMs, enabling them to tackle complex tasks that demand spatial reasoning.
Shaofei Cai*, Bowei Zhang*, Zihao Wang, Xiaojian Ma, Anji Liu, Yitao Liang (* equal contribution)
ICLR 2025
We frame the problem as a semi-supervised learning task and introduce GROOT-2, a multimodal instructable agent trained using a novel approach that combines weak supervision with latent variable models. Our method consists of two key components: constrained self-imitating, which utilizes large amounts of unlabeled demonstrations to enable the policy to learn diverse behaviors, and human intention alignment, which uses a smaller set of labeled demonstrations to ensure the latent space reflects human intentions.
Shaofei Cai*, Bowei Zhang*, Zihao Wang, Xiaojian Ma, Anji Liu, Yitao Liang (* equal contribution)
ICLR 2025
We frame the problem as a semi-supervised learning task and introduce GROOT-2, a multimodal instructable agent trained using a novel approach that combines weak supervision with latent variable models. Our method consists of two key components: constrained self-imitating, which utilizes large amounts of unlabeled demonstrations to enable the policy to learn diverse behaviors, and human intention alignment, which uses a smaller set of labeled demonstrations to ensure the latent space reflects human intentions.
Zihao Wang, Shaofei Cai, Zhancun Mu, Haowei Lin, Ceyao Zhang, Xuejie Liu, Qing Li, Anji Liu, Xiaojian Ma, Yitao Liang
NeurIPS 2024
An end-to-end open-ended agent based on Vision-Language-Action (VLA) models with self-supervised behavior tokenizer, that can answer questions and follow instructions in open-world Minecraft.
Zihao Wang, Shaofei Cai, Zhancun Mu, Haowei Lin, Ceyao Zhang, Xuejie Liu, Qing Li, Anji Liu, Xiaojian Ma, Yitao Liang
NeurIPS 2024
An end-to-end open-ended agent based on Vision-Language-Action (VLA) models with self-supervised behavior tokenizer, that can answer questions and follow instructions in open-world Minecraft.
Zihao Wang, Anji Liu, Haowei Lin, Jiaqi Li, Xiaojian Ma, Yitao Liang
NeurIPS Workshop 2024
An agent with retrieval-augmented thought that can conduct code generation, math reasoning, embodied planning and open-ended question answering.
Zihao Wang, Anji Liu, Haowei Lin, Jiaqi Li, Xiaojian Ma, Yitao Liang
NeurIPS Workshop 2024
An agent with retrieval-augmented thought that can conduct code generation, math reasoning, embodied planning and open-ended question answering.
Haowei Lin, Baizhou Huang, Haotian Ye, Qinyu Chen, Zihao Wang, Sujian Li, Jianzhu Ma, Xiaojun Wan, James Zou, Yitao Liang
ICML 2024
This work forms this resource-constrained selection task into predicting fine-tuning performance and introduces the concept of "pre-learned data size" into the Rectified Scaling Law, which overcomes theoretical limitations and fits experimental results much better.
Haowei Lin, Baizhou Huang, Haotian Ye, Qinyu Chen, Zihao Wang, Sujian Li, Jianzhu Ma, Xiaojun Wan, James Zou, Yitao Liang
ICML 2024
This work forms this resource-constrained selection task into predicting fine-tuning performance and introduces the concept of "pre-learned data size" into the Rectified Scaling Law, which overcomes theoretical limitations and fits experimental results much better.
Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, Yitao Liang
T-PAMI 2024
A multi-task agent that can self-improve in open-ended Minecraft and accomplish up to 200+ tasks.
Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, Yitao Liang
T-PAMI 2024
A multi-task agent that can self-improve in open-ended Minecraft and accomplish up to 200+ tasks.
Shaofei Cai, Bowei Zhang, Zihao Wang, Xiaojian Ma, Anji Liu, Yitao Liang
ICLR 2024 Spotlight
This work proposes to follow reference videos as instructions, which offer expressive goal specifications while eliminating the need for expensive text-gameplay annotations, and implements the agent GROOT in a simple yet effective encoder-decoder architecture based on causal transformers.
Shaofei Cai, Bowei Zhang, Zihao Wang, Xiaojian Ma, Anji Liu, Yitao Liang
ICLR 2024 Spotlight
This work proposes to follow reference videos as instructions, which offer expressive goal specifications while eliminating the need for expensive text-gameplay annotations, and implements the agent GROOT in a simple yet effective encoder-decoder architecture based on causal transformers.
Ceyao Zhang, Kaijie Yang, Siyi Hu, Zihao Wang, Guanghe Li, Yihang Sun, Cheng Zhang, Zhaowei Zhang, Anji Liu, Song-Chun Zhu, Xiaojun Chang, Junge Zhang, Feng Yin, Yitao Liang, Yaodong Yang
AAAI 2024 Oral
ProAgent is a novel framework that harnesses large language models (LLMs) to create proactive agents capable of dynamically adapting their behavior to enhance cooperation with teammates, and exhibits a high degree of modularity and interpretability, making it easily integrated into various of coordination scenarios.
Ceyao Zhang, Kaijie Yang, Siyi Hu, Zihao Wang, Guanghe Li, Yihang Sun, Cheng Zhang, Zhaowei Zhang, Anji Liu, Song-Chun Zhu, Xiaojun Chang, Junge Zhang, Feng Yin, Yitao Liang, Yaodong Yang
AAAI 2024 Oral
ProAgent is a novel framework that harnesses large language models (LLMs) to create proactive agents capable of dynamically adapting their behavior to enhance cooperation with teammates, and exhibits a high degree of modularity and interpretability, making it easily integrated into various of coordination scenarios.
Haowei Lin, Zihao Wang, Jianzhu Ma, Yitao Liang
NeurIPS Workshop 2023
Minecraft Universe is introduced, a comprehensive evaluation framework set within the open-world video game Minecraft that combines a task composition mechanism capable of generating infinite diverse tasks with varying difficulty and a general evaluation framework that achieves 91.5% alignment with human ratings for open-ended task assessment
Haowei Lin, Zihao Wang, Jianzhu Ma, Yitao Liang
NeurIPS Workshop 2023
Minecraft Universe is introduced, a comprehensive evaluation framework set within the open-world video game Minecraft that combines a task composition mechanism capable of generating infinite diverse tasks with varying difficulty and a general evaluation framework that achieves 91.5% alignment with human ratings for open-ended task assessment
Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, Yitao Liang
NeurIPS 2023 Best Paper Award, ICML 2023 TEACH Workshop
We investigate the challenge of task planning for multi-task embodied agents in open-world environments. We propose"Describe, Explain, Plan and Select"(DEPS), an interactive planning approach based on Large Language Models. DEPS facilitates better error correction on initial LLM-generated plan by integrating description of the plan execution process and providing self-explanation of feedback when encountering failures during the extended planning phases. Furthermore, it includes a goal selector, which is a trainable module that ranks parallel candidate sub-goals based on the estimated steps of completion, consequently refining the initial plan. Our experiments mark the milestone of the first zero-shot multi-task agent that can robustly accomplish 70+ Minecraft tasks and nearly double the overall performances.
Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, Yitao Liang
NeurIPS 2023 Best Paper Award, ICML 2023 TEACH Workshop
We investigate the challenge of task planning for multi-task embodied agents in open-world environments. We propose"Describe, Explain, Plan and Select"(DEPS), an interactive planning approach based on Large Language Models. DEPS facilitates better error correction on initial LLM-generated plan by integrating description of the plan execution process and providing self-explanation of feedback when encountering failures during the extended planning phases. Furthermore, it includes a goal selector, which is a trainable module that ranks parallel candidate sub-goals based on the estimated steps of completion, consequently refining the initial plan. Our experiments mark the milestone of the first zero-shot multi-task agent that can robustly accomplish 70+ Minecraft tasks and nearly double the overall performances.
Shaofei Cai, Zihao Wang, Xiaojian Ma, Anji Liu, Yitao Liang
CVPR 2023
This work proposes Goal-Sensitive Backbone (GSB) for the policy to encourage the emergence of goal-relevant visual state representations in Minecraft and proposes an adaptive horizon prediction module that helps alleviate the learning uncertainty brought by the non-stationary dynamics
Shaofei Cai, Zihao Wang, Xiaojian Ma, Anji Liu, Yitao Liang
CVPR 2023
This work proposes Goal-Sensitive Backbone (GSB) for the policy to encourage the emergence of goal-relevant visual state representations in Minecraft and proposes an adaptive horizon prediction module that helps alleviate the learning uncertainty brought by the non-stationary dynamics
Zihao Wang, Chunxu Wu, Yifei Yang, Zhen Li
CVPR 2023
The task of key-points detection and description is to estimate the stable location and discriminative representation of local features, which is a fundamental task in visual applications and this work proposes to learn the transformation-predictive representations with self-supervised contrastive learning.
Zihao Wang, Chunxu Wu, Yifei Yang, Zhen Li
CVPR 2023
The task of key-points detection and description is to estimate the stable location and discriminative representation of local features, which is a fundamental task in visual applications and this work proposes to learn the transformation-predictive representations with self-supervised contrastive learning.
Zihao Wang, Zhen Li, Xueyi Li, Wenjie Chen, Xiangdong Liu
TNNLS 2022
A self-supervised graph-based contrastive learning framework to train the model for local features, GCLFeat, which outperforms the state-of-the-art supervised baselines on diverse downstream benchmarks including image matching, 3-D reconstruction and visual localization.
Zihao Wang, Zhen Li, Xueyi Li, Wenjie Chen, Xiangdong Liu
TNNLS 2022
A self-supervised graph-based contrastive learning framework to train the model for local features, GCLFeat, which outperforms the state-of-the-art supervised baselines on diverse downstream benchmarks including image matching, 3-D reconstruction and visual localization.
Zihao Wang, Xueyi Li, Zhen Li
IJCAI 2021
A novel Soft Point-Wise Transformer for Descriptor and Detector, simultaneously mining long-range intrinsic and cross-scale dependencies of local features, which outperforms the existing state-of-the-art methods on the image matching and visual localization benchmarks.
Zihao Wang, Xueyi Li, Zhen Li
IJCAI 2021
A novel Soft Point-Wise Transformer for Descriptor and Detector, simultaneously mining long-range intrinsic and cross-scale dependencies of local features, which outperforms the existing state-of-the-art methods on the image matching and visual localization benchmarks.