Homepage - Zihao Wang

Zihao Wang

PhD student at Peking University.

I'm a Ph.D. student at the Institute for Artificial Intelligence, Peking University (PKU), advised by Prof. Yitao Liang. I also collaborate closely with Dr. Xiaojian Ma and Prof. Anji Liu. Before joining PKU, I earned my MSc and BA degrees in Control Science and Technology from Beijing Institute of Technology.

My research centers on building open-ended generalist agents, including computer-use agents, embodied game agents, and deep research agents. My core interest lies in building and leveraging large pre-trained Foundation Models (LLMs, VLMs, VLA) to significantly enhance agent generalization capabilities. My research contributions fall into two main categories:

Agentic Workflow: DEPS (Planning Agent), JARVIS-1 (Self-improving Agent with Multimodal Memory), RAT (Open Deep Research Agent), ROCKET-1 (Embodied Agent with GUI action space), ProAgent (Collaborating LLM-based Game Agents).
Agentic Foundation Models: UI-TARS-1.5 (An open-source multimodal GUI and Game agent built upon a powerful vision-language model), OmniJARVIS (Hierarchical VLA with latent action space), JARVIS-VLA (A vision-language-action model in open worlds).

Curriculum Vitae

zhwang(at)stu.pku.edu.cn Google Scholar GitHub Twitter

Education

Peking University

Institute for Artificial Intelligence
Ph.D. Candidate

Sep. 2022 - present
Beijing Institute of Technology

School of Automation
M.S. in Control Science and Technology

Sep. 2019 - Jul. 2022
Beijing Institute of Technology

School of Automation
B.Eng. in Automation

Sep. 2015 - Jul. 2019

Experience

Reviewer

ICML, NeurIPS, ICLR, CVPR, ECCV, AAAI.

2021 - present
Research Intern

ByteDance Seed.

Dec. 2024 - present
Research Intern

Alibaba Inc.

May. 2021 - Aug. 2021

Honors & Awards

Best Paper Award, ICML 2023 TEACH Workshop

2023
Chinese National Scholarship

2021
Outstanding Graduate of Beijing

2019
Autonomy Prize of Indoor Event on 10th International Micro Air Vehicle Competition and Conference, Melbourne.

2018
Meritorious Winner on American Mathematical Contest In Modeling (MCM)

2018

News

2025

Our paper "Open-World Skill Discovery from Unsegmented Demonstrations" is accepted by ICCV 2025.

Jun 26

JARVIS-VLA is accepted as ACL findings.

May 16

MCU (evaluation of open-world agents) is selected as a spotlight paper by ICML 2025.

May 01

1 paper is accepted by CVPR 2025 and 1 paper is accepted by ICLR 2025.

Feb 28

2024

JARVIS-1 is accepted by T-PAMI.

Dec 02

Our latest Vision-Language-Action models OmniJARVIS is accepted by NeurIPS 2024.

Sep 02

We will organize the 1st Open-world Agent Workshop in NeurIPS 2024 (Vancouver, BC, Canada).

Aug 02

1 paper is accepted by ICML 2024.

Jun 01

GROOT is accepted by ICLR 2024 for spotlight presentation (top 5%).

Jan 02

ProAgent is accepted by AAAI 2024 for oral presentation.

Jan 01

2023

DEPS is accepted by NeurIPS 2023.

Sep 05

DEPS received Best Paper Award at ICML 2023 TEACH Workshop!

Jul 05

Two papers are accepted by CVPR 2023.

Feb 01

Selected Publications (view all )

UI-TARS-1.5

Bytedance Seed

arXiv 2025

UI-TARS-1.5 is an open-source multimodal agent built upon a powerful vision-language model. It is capable of effectively performing diverse tasks within virtual worlds. Leveraging the foundational architecture introduced in our recent paper, UI-TARS-1.5 integrates advanced reasoning enabled by reinforcement learning. This allows the model to reason through its thoughts before taking action, significantly enhancing its performance and adaptability, particularly in inference-time scaling. Our new 1.5 version achieves state-of-the-art results across a variety of standard benchmarks, demonstrating strong reasoning capabilities and notable improvements over prior models.

[Project] [Paper] [Huggingface Models]

UI-TARS-1.5

Bytedance Seed

arXiv 2025

[Project] [Paper] [Huggingface Models]

Seed1.5-VL Technical Report

Bytedance Seed

arXiv 2025

We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research.

[Project] [Paper] [API]

Seed1.5-VL Technical Report

Bytedance Seed

arXiv 2025

[Project] [Paper] [API]

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

Muyao Li*, Zihao Wang*, Kaichen He, Xiaojian Ma, Yitao Liang (* equal contribution)

ACL Findings 2025

Visual Language Action (VLA) models, pretrained on large-scale web datasets, have shown promise in decision-making tasks. However, previous work has primarily focused on action post-training, often neglecting enhancements to the foundational model itself. In response, we introduce a novel approach, Act from Visual Language Post-Training, which refines Visual Language Models (VLMs) through visual and linguistic guidance in a self-supervised manner. This enhancement improves the models' capabilities in world knowledge, visual recognition, and spatial grounding in open-world environments. Following the above post-training paradigms, we obtain the first VLA models in Minecraft that can follow human instructions on over 1k different atomic tasks, including crafting, smelting, cooking, mining, and killing.

[Project] [Paper] [Code] [Dataset] [Model]

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

Muyao Li*, Zihao Wang*, Kaichen He, Xiaojian Ma, Yitao Liang (* equal contribution)

ACL Findings 2025

[Project] [Paper] [Code] [Dataset] [Model]

OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

Zihao Wang, Shaofei Cai, Zhancun Mu, Haowei Lin, Ceyao Zhang, Xuejie Liu, Qing Li, Anji Liu, Xiaojian Ma, Yitao Liang

NeurIPS 2024

An end-to-end open-ended agent based on Vision-Language-Action (VLA) models with self-supervised behavior tokenizer, that can answer questions and follow instructions in open-world Minecraft.

[Project] [Paper] [Twitter]

OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

Zihao Wang, Shaofei Cai, Zhancun Mu, Haowei Lin, Ceyao Zhang, Xuejie Liu, Qing Li, Anji Liu, Xiaojian Ma, Yitao Liang

NeurIPS 2024

An end-to-end open-ended agent based on Vision-Language-Action (VLA) models with self-supervised behavior tokenizer, that can answer questions and follow instructions in open-world Minecraft.

[Project] [Paper] [Twitter]

RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation

Zihao Wang, Anji Liu, Haowei Lin, Jiaqi Li, Xiaojian Ma, Yitao Liang

NeurIPS Workshop 2024

An agent with retrieval-augmented thought that can conduct code generation, math reasoning, embodied planning and open-ended question answering.

[Project] [Demo] [Paper] [Code] [Twitter]

RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation

Zihao Wang, Anji Liu, Haowei Lin, Jiaqi Li, Xiaojian Ma, Yitao Liang

NeurIPS Workshop 2024

An agent with retrieval-augmented thought that can conduct code generation, math reasoning, embodied planning and open-ended question answering.

[Project] [Demo] [Paper] [Code] [Twitter]

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models

Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, Yitao Liang

T-PAMI 2024

A multi-task agent that can self-improve in open-ended Minecraft and accomplish up to 200+ tasks.

[Project] [Paper] [Code] [Twitter] [Media]

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models

Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, Yitao Liang

T-PAMI 2024

A multi-task agent that can self-improve in open-ended Minecraft and accomplish up to 200+ tasks.

[Project] [Paper] [Code] [Twitter] [Media]

Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents

Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, Yitao Liang

NeurIPS 2023 Best Paper Award, ICML 2023 TEACH Workshop

We investigate the challenge of task planning for multi-task embodied agents in open-world environments. We propose"Describe, Explain, Plan and Select"(DEPS), an interactive planning approach based on Large Language Models. DEPS facilitates better error correction on initial LLM-generated plan by integrating description of the plan execution process and providing self-explanation of feedback when encountering failures during the extended planning phases. Furthermore, it includes a goal selector, which is a trainable module that ranks parallel candidate sub-goals based on the estimated steps of completion, consequently refining the initial plan. Our experiments mark the milestone of the first zero-shot multi-task agent that can robustly accomplish 70+ Minecraft tasks and nearly double the overall performances.

[Paper] [Code] [Twitter]

Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents

Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, Yitao Liang

NeurIPS 2023 Best Paper Award, ICML 2023 TEACH Workshop

[Paper] [Code] [Twitter]

Warning

Action required

Education

Experience

Honors & Awards

News

Selected Publications (view all )

UI-TARS-1.5

UI-TARS-1.5

Seed1.5-VL Technical Report

Seed1.5-VL Technical Report

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation

RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models

Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents

Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents

All publications