DeepMind, the minds behind AlphaFold, has unveiled its latest project: Gato -- a “general purpose” system that’s designed to take on several different tasks.
Most AI systems are taught one or two responsibilities. The Google-owned company has come up with a method by which an AI system can undertake a handful of different tasks.
Gato – which is short for ‘A General Agent’ - can play Atari, caption images, chat, stack blocks with a real robot arm, and more. The system can decide whether to output text, joint torques, button presses, or other tokens based on context.
The likes of Nando de Freitas, Yutian Chen, and Ali Razavi were part of the team behind the system.
How it works
In a paper detailing Gato, the researchers sought to apply a similar approach found in large-scale language modeling. Gato was trained on data covering different tasks and modalities. This data was serialized into a flat sequence of tokens which was then batched and processed by a transformer neural network similar to a large language model.
The loss is masked so that Gato only predicts action and text targets, the paper reads.
Upon deployment, a prompt is tokenized, which forms an initial sequence. The environment yields the first observation – which again, is tokenized and appended to the sequence. Gato then samples the action vector autoregressively, one token at a time.
Once all tokens comprising the action vector have been sampled, the action is decoded and sent to the environment which steps and yields a new observation. Then the procedure repeats. The DeepMind researchers suggest the model “always sees all previous observations and actions within its context window of 1024 tokens.”
What is the capital of France?
The system itself is built on a sizable dataset comprising data from both simulated and real-world environments. It was also built using several natural language and image datasets.