Digital Agents and VLMs

Tianbao Xie and Danyang Zhang

University of Hong Kong and Shanghai Jiao Tong University

The NLP Reading Group is happy to have Tianbao Xie and Danyang Zhang give a talk about “Digital Agents and VLMs”.

Talk Description

In this talk, we will explore the transformative potential of autonomous digital agents powered by large vision-language models (VLMs) in enhancing human-computer interactions. These agents can follow high-level natural language instructions to perform complex tasks across various applications and interfaces, significantly boosting productivity. However, the development of such multimodal agents has been hindered by the lack of a comprehensive benchmark that reflects the diversity and complexity of real-world computer use. To address this, we introduce OSWORLD, a scalable, real computer environment designed for developing and evaluating multimodal agents capable of executing a wide range of tasks across different operating systems and applications. OSWORLD supports free-form keyboard and mouse control, initial task state configuration, execution-based evaluation, and interactive learning making it a unified platform for defining and testing agent tasks without the need for application-specific simulated environments. Join us to learn about the concept of XLANG, VLM-based agents, and the OSWORLD benchmark.

Speaker Bios

Tianbao Xie is a Ph.D. candidate at XLANG NLP Lab of HKU NLP Group from the University of Hong Kong, under the guidance of Prof. Tao You and Prof. Lingpeng Kong. His primary research interest is in Natural Language Processing and Artificial Intelligence. He is passionate about key problems in building applications to liberate human employees from labor. Currently, He focuses mainly on manipulating large language models for NLP and Robotics tasks (UnifiedSKG, Binder, Text2reward, OpenAgents, and OSWorld). He aims to approach them in an agent-centric way for modeling in 2023 and 2024.

Danyang Zhang is a Ph.D. candidate at X-Lance Lab, Shanghai Jiao Tong University, under the supervision of Prof. Kai Yu, majoring in computer science and technology. Before that, He received my B.S. degree from the Department of Automation, Tsinghua University in 2020. In the previous 4 months, He worked as a research assistant with Prof. Tao Yu at Xlang, HKU. He collaborated with Tianbao on the work of OSWorld. His research interest focuses on text-rich visual UI interaction. Currently, He is working on the construction of realistic, complex interaction benchmarks for GUI interaction. He is also studying how to design smarter GUI agents with reinforcement learning (RL) and large language models (LLM), or by combining both.

Logistics

Date: May 27th
Time: 11:00AM
Location: F01 or via Zoom (See email)