It’s been nearly one year since the launch of ChatGPT, a launch that has sparked tremendous excitement about the capabilities of Large Language Models (LLMs). In just the span of a year, these models have demonstrated their ability to generate human-like text, images, code and more.
LLMs such as GPT-3 are opening doors to new applications in customer service, content creation, and beyond. This technology also serves an important role in robotics, making interaction between humans and machines smoother and more intuitive.
It was nearly a decade ago when Apple first introduced Siri, showcasing how natural language could be integrated with electronics such as phones. Since then, natural language processing capabilities have advanced tremendously, leading to applications such as ChatGPT and Claude.
These LLMs trained on massive datasets have incredible natural language understanding, reasoning, and generation abilities. This makes them a promising tool for powering more intuitive communication in robotics applications.
In November 2022, ChatGPT launched to great fanfare. It became one of the fastest growing applications of all time, growing to 1M users in just 5 days. Since then it’s become a household name, popular even among those without a technical background. There are now many generative AI chat agents including Google’s Bard and Anthropic’s Claude.
While LLMs possess vast amounts of knowledge acquired from training, they lack a presence in the physical realm. Conversely, robots operate in the physical world, but interfacing with them can be challenging and limited. LLMs can help bridge the gap between humans and robots by streamlining communications and creating a more seamless experience.
Prior to LLMs, human-robot communication primarily relied on more rudimentary methods such as direct programming, remote control, touchscreen interfaces, and basic voice commands. Interaction was often limited, and mostly one way. Limitations meant humans had to accommodate, i.e. you’d have to use an exact phrase to specify a command.
LLMs have the ability to understand nuanced natural language in a wide array of settings, allowing robots to comprehend instructions, requests and contexts. Instead of writing code, language instructions can be used to configure or direct robot actions. The ability to have natural conversation with robots could expand the environments and workflows robots can operate in.
Multimodal models are designed to take inputs from multiple types of data such as text, images, audio, sensor data or video. This gives robots contextual understanding. By combining multiple data streams, robots can recognize an object visually and then use text to understand what it is or how to interact with it. It can also enhance object recognition. If a robot is unable to identify an object or misses a visual cue, you could help it understand by speaking to it. This can improve navigation, allowing robots to function in a broader range of environments.
Multimodal models such as PaLM-E are specifically designed to help robots with visual tasks and object detection. Google has been using PaLM-E for basic tasks such as sorting blocks and fetching items. The results show how models can be trained to simultaneously address vision and language in robotics.
Prior to LLMs, troubleshooting robots meant sifting through logs and analyzing sensor data to pinpoint the origins of a failure. Once identified, the issue would be traced to specific sections of the code responsible for the undesired behavior. This is a foundational approach which is very effective, however it can be time consuming and laborious.
LLMs have the ability to reason and solve problems. They’re designed to recognize patterns in data and make logical inferences. When integrated with robots, they can enhance the robot’s diagnostic abilities. This means that robots may, in some instances, find and diagnose issues on their own. As with all LLMs, the diagnostic capabilities are limited to what the robot has been trained on.
While the models are good at spotting patterns, they lack intuition. If the problem is complex or is missing context, it may be difficult for the model to troubleshoot. The models are known to hallucinate, so the robot may incorrectly diagnose an issue but do so in a very confident way. Self diagnosis should augment a technician’s ongoing work rather than be the single source of truth.
While LLMs present an exciting step forward for robotics, the integration comes with challenges. Depending on how the model was trained, it may be biased. This is a common problem with models that has yet to be adequately addressed. Explainability is another issue to consider: the reasoning behind a model’s output can be opaque, making it hard to predict and catch mistakes. The model may even exhibit erratic behavior.
The AI landscape is changing quickly with new models coming out frequently. The fusion of LLMs with robotics signifies a major advancement which has the promise of bringing robots one step closer to everyday use.
While the benefits are vast, it’s important to understand the limitations and challenges the models are facing. With proper training and constraints, LLMs can change the way we interact with robots. Their potential to enable a smoother and more human-like communication in real world settings makes it a worthwhile endeavor.
“All those moments will be lost in time, like tears in rain.” -Batty, Bladerunner (1982)