Google Cloud and DeepMind have introduced a framework, Watch & Learn (W&L), to address the challenge of training computer-use agents (CUAs) efficiently. This framework aims to alleviate the data bottleneck in CUA development by automatically generating high-quality training examples without human annotation, using raw videos for demonstrations.
By utilizing the data produced by Watch & Learn, companies can enhance their existing computer-use models and create custom CUAs for internal tasks, eliminating the need for expensive specialized training models. The framework’s approach not only improves performance on computer-use tasks but also enables in-context learning examples for CUAs, facilitating real-world application without extensive manual intervention.
Watch & Learn’s methodology revolves around redefining the creation of CUA demonstrations by focusing on the ‘inverse dynamics objective.’ This strategy involves predicting intermediate actions between consecutive observations rather than generating trajectories directly, leading to more robust and generalized outcomes across applications.
The framework’s three key stages include training an inverse dynamics model, retrieving raw videos, and training CUA agents. By employing this process, the researchers were able to generate a substantial corpus of state transitions and produce annotated trajectories with high-accuracy action labels.
In testing, Watch & Learn demonstrated improvements in fine-tuning open-source models and enhancing the performance of general-purpose multimodal models for in-context learning. These advancements showcase the scalability and practicality of utilizing web-scale human workflows to advance CUAs towards real-world deployment.
This framework streamlines CUA development and enables enterprises to leverage existing video resources for training data, paving the way for more efficient and cost-effective CUA implementations.
Source: VentureBeat