[Update, November 27th, 2023: All Embodied AI scenario mentioned in this article can now be accessed for free via the FalconCloud web app at https://falcon.duality.ai ]
Success of LLM-based tools like ChatGPT, Perplexity and MidJourney have made generative AI models widely accessible, even turning them into household names. However, despite their theorized potential, their current impact remains limited to manipulating bits on the internet. On the other hand, robots and smart systems operate in, and impact, our physical world, but their evolution is hitting significant roadblocks. Can advanced AI models be harnessed to make robots smarter, safer and natural collaborators?
At Duality, we believe this is not only possible, but can actually reduce cost and speed up the developing and deploying of autonomous robots. Since 2019 Duality has been collaborating with the industry’s leading robotics and machine learning teams, including those at DARPA, NASA-JPL, and Honeywell, to solve complex challenges facing smart systems. We have been developing digital twin simulation that enables comprehensive training and testing of autonomous systems, and has yielded real-world validation that minimizes the Sim2Real gap. It is this work that has convinced us that digital twin simulation offers the most viable way forward for safely and responsibly bringing AI into our physical world, and moving autonomous robotics past current hurdles.
Building off the growing body of research in Embodied AI (EAI), which provides a foundational framework for thinking about this fascinating question, this paper presents a feasible approach with an application focus. In the following sections we discuss the current challenges in autonomous robotics, review recent AI developments, offer potential next steps in adopting advanced AI models for real-world robotics applications, and make a case for the critical role of simulation in the embodied AI workbench.
After investing $75b and over 18 years of focused development, self-driving cars still remain a work-in-progress. Coding autonomy software is a perpetual game of “whack-a-mole”. As systems tentatively make their way out of labs and expand testing into the real world, the number of edge cases increases exponentially. Explicitly coding around each one leads to mounting system complexity, ballooning costs, and endless delays in deployment, even for the most capable of robotics teams. And this is just for on-road autonomy within a handful of cities! Without a specific road to follow, off-road autonomy presents a whole new set of challenges. How do we build systems that are more resilient to the dynamic conditions of the real world and respond emergently to situations they haven’t been specifically programmed for?
Another common problem is that system intelligence is often brittle. For example, an autonomous drone developed to inspect utility poles in a particular geo-location has to be re-programmed and re-validated to detect new defects or operate in a different geolocation, even with zero changes to the drone’s hardware or sensors. Even small changes to a mission can cost upwards of $250k and operationalizing them can take 6 months or more. What if, just like ChatGPT, a drone could be prompted to perform an inspection mission instead? The operator could simply instruct the drone to “Inspect the oil pipeline for 4 kms starting at GPS location (70.25 -148.62) and look for corrosion defects similar to attached images. Log GPS location and images for every defect found.” How do we get autonomous systems, like a prompt driven inspection drone, to respond naturally and behave predictably over more generalized domains of operation?
Lastly, as expensive as it is to develop and test smart systems, deploying them, with contingency on validation, certification and integration, is an order of magnitude more difficult. This stems from a multitude of factors ranging from establishing predictability and safety in open domains to the inflexibility of robots in dealing with existing infrastructure or collaborating fluidly with humans and other systems. In most current situations, these burdens tip the ROI scale away from fully autonomous systems, resulting in a hard to manage hodgepodge of autonomous, semi-autonomous and manual systems, operating in a less than scalable fashion.
Thus the ultimate promise of the impact of robots and smart systems remains unfulfilled. Which brings us back to our main question: can we harness the recent advances in AI models, specifically generative AI, to make our machines more intelligent and capable enough to surpass these problems?
AI research has been steadily advancing since the major breakthroughs in deep learning and vision problems at the turn of the century. Specifically, transformer models for large language models (LLMs), such as ChatGPT, PaLM, LLama and StarCoder, have emerged to take on general reasoning tasks such as copilots for code development, smart search, and much more. Diffuser models, such as StableDiffusion, MidJourney, and DALL-E, bring multimodality and visual capabilities to the output of neural networks. Perhaps of equal importance is the ability of these models to process natural language and interact with users in an intuitive way.
There are some key commonalities in these generative AI models. Their astounding intelligence and feeling of emergence stems from extremely large neural networks and equally massive learning datasets. For example, StarCoder’s base model has 15 billion parameters and was trained on a trillion tokens. It was further tuned for the Python language on an additional 35 billion tokens. Training and even querying these models requires high performance computing and tons of memory. They are typically served by cloud providers and run in power-hungry and carefully cooled data centers. These requirements are perfectly fine for internet AI, but are less suited for distributed machine intelligence in inconsistent real-world conditions. Similarly, the only sources for massive datasets which these models consume are generally only readily available on the internet.
This has led to rapid adoption of these models on internet tasks we call bit manipulators such as generating digital media, enhancing communication capabilities, creating code, etc. Currently their scale and impact is limited in the physical world given their inability to translate their operations to atoms. In order to adopt internet AI models into the real world and enable bit manipulators to become atom manipulators, we need to first build an understanding of embodied intelligence.
In their insightful cross-disciplinary collaboration between developmental psychology and computer science, Linda Smith & Michael Gasser arrive at the embodiment hypothesis based on the observation that “traditional theories of intelligence concentrated on symbolic reasoning, paying little attention to the body and to the ways intelligence is affected by and affects the physical world.”. They further observe that, “intelligence emerges in the interaction of an agent with an environment and as a result of sensorimotor activity.”
They lay out six key lessons for developing embodied intelligent agents, noting that their learning should be:
The Embodiment Hypothesis holds key evolutionary insights to enabling the kind of generally adaptable robots we have been pursuing. Generative AI advances some of the learning capabilities described above, but without a physical body or embodiment, it lacks grounding and the ability to take actions. By embodying AI in a robotic system we can give it the levers for improving resilience in novel situations; generalizing learning & operating domains; and creating a more natural and contextual way to interact with smart systems – the very problems that are holding back smart systems from realizing their full potential and impact.
This idea is by no means radical — promising recent research from places like Google Research, Microsoft Research, Meta Habitat, among others, provides supportive evidence. However, in order to bring Embodied AI from the lab to the real world, we must address some key challenges.
Adopting advanced AI models in robotics poses several challenges on both the model and the data side. Compared to only manipulating bits, manipulating atoms carries much higher stakes and consequences: physical world actions cannot easily be undone. The results can include flipped over vehicles or drones that crash in populated areas with valuable payloads. Models have to be predictable and systems have to be certified for safe operation. There is simply no room for hallucination!
On the model front some current key challenges include:
Equally important are the data challenges:
WIth this in mind, we must now ask if it is indeed possible to take internet AI models, rebase them in the principles of embodied learning and overcome these significant challenges to adopting them as atom manipulators – or will this largely remain an area of research for the foreseeable future?
Duality’s work is based on deep rooted understanding of digital twin simulation and customer engagements with the best robotics teams across diverse system types and applications. Through our experiences of working with these teams we have come to believe that adopting generative AI models in autonomy is both possible and can be immediately beneficial. It will, however, require selective, tactical, and application-centered focus and the development of new workflows and supporting tools. In the following two sections we expand on this tactical focus, and new workflows that can support EAI development.
Research on generative AI and artificial general intelligence (AGI) continues at a frantic pace with over a trillion dollars being invested over the next several years in the training processors alone. Even as this work continues, we believe that current LLMs and multimodal models are already sufficiently capable to go after autonomy problems that have plagued smart systems. Duality’s approach is to have a narrower application focus that goes after well defined tactical challenges.
By necessity, our approach is hybrid, and builds on aspects of autonomy development that are well understood and widely used. For example, current methods for feedback controllers or sensor signal processing are efficient and predictable. Leveraging that part of the autonomy stack while augmenting it with generative AI models makes more sense to us than pursuing comprehensive but elusive end-to-end models or reinforcement learning for low level policy.
In particular, we have identified two areas for active development that we find promising:
Task Coding: Takes advantage of the code generation capabilities of LLMs to perform general reasoning and take higher level mission goals, framed as prompts, and turns them into well defined coded tasks capable of being directly run on the system. It is important to clarify that the LLM is not so much writing code as using language reasoning to turn a prompted mission into a sequence of pre-defined executable tasks. Task coding can be an effective way to make systems less brittle and generalize domains of operations while still remaining predictable, as in the infrastructure inspection drone example mentioned earlier. [Readers can experiment with the Task Coding here. First time users will need to create a free FalconCloud account]
Visual Reasoning: Takes advantage of visual-language based models such as GroundingDINO, ViperGPT and VisProg. This allows the AI enabled sensing capabilities of robots to go beyond narrowly trained neural networks for classification, object detection and segmentation and into foundation models and zero-shot detection in unseen environments. Furthermore, it enables language-based reasoning of what is in individual sensor samples as well as synchronized sequence of multi-sensor data over time. We believe that this enhanced sensing capability can make systems more resilient, especially to novel scenarios, and help them “think on their feet”. [Readers can experiment with the Visual Reasoning here. First time users will need to create a free FalconCloud account]
This kind of tactical focus presents an approach that is carefully balanced for predictable system behavior based on current AI capabilities. Rather than mirroring human psychological development where babies are born with a minimal set of onboard sensing and motor functions, smart systems can be pre-loaded with a base set of well understood and predictable capabilities, as well language and communication skills, that they can scaffold from while still remaining adaptable and resilient. Rather than a baby learning to walk, this is akin to a college grad onboarding into their first job. We also acknowledge that the best balance for predictable systems will be a moving target as machine intelligence continues its march towards AGI.
It is important to address that weaving in advanced AI models into system capabilities in this way does require new workflows that take a more holistic view of the autonomy stack. Current approaches to autonomy development often break the autonomy problem into a hierarchy of functional layers: perception, planning, navigation and controls. To help imbue predictability and desirable behavior into the overall system, conventional simulation and testing are commonly applied separately to each individual layer. This approach however does not lend itself to EAI.
By its very nature, EAI is closed-loop and requires a full closed loop simulator to exercise all layers of the autonomy stack in close coordination. It’s easy to see why when we consider even basic motor skills— such as the ones exercised when we reach for an apple that someone hands to us. This requires a continuous, closed-loop sensing and adjustment of our arm and hand position – a surprisingly hard and dynamic problem that we perform effortlessly. Since embodied intelligence is continuously adjusting tasks and even the sequencing of tasks in order to achieve the overall mission, it is similarly predicated on a tight feedback loop of sense-plan-act.
To work with EAI, development teams will need the ability to quickly adopt, interpret and tune AI models that best serve their specific needs. Current structures for accessing internet AI models may not be well suited for this. As an example, LLMs may only be called infrequently for task coding and refining mission behaviors in a simulated and adequately networked environment, while visual reasoning may have to be performed at a much higher frequency and directly on raw sensor streams out in the field where connectivity could be poor or non-existent. Understanding how intelligence is spread across the cloud and edge will be critical and will be tied intrinsically to network connectivity and the embedded system capabilities of edge devices.
Additionally, as robots move away from highly programmed and pre-defined behaviors towards more generalized intelligence, the role of a human supervisor or operator becomes more important. This can take the form of training and on-boarding smart systems to specific jobs and operating environments as well as double checking their intended behaviors when given new or stretched responsibilities – not so different from the manager-direct report structures of our human workplaces.
As we just discussed, holistic and closed-loop simulation is critical to integrating and operating EAI systems. Luis Bermudez, Intel’s Machine Learning Product Manager captures this very succinctly: “Datasets have been a key driver of progress in Internet AI. With Embodied AI, simulators will assume the role played previously by datasets.” He further points out that, “Datasets consist of 3D scans of an environment. These datasets represent 3D scenes of a house, or a lab, a room, or the outside world. These 3D scans however, do not let an agent ‘walk’ through it or interact with it. Simulators allow the embodied agent to physically interact with the environment and walk through it”.
With the above in mind, it is no stretch to say that digital twin simulation can provide a workbench that is perfectly tailored to EAI development. Let’s take a look into how digital twin simulation supports EAI work.
First and foremost, a digital twin simulator is only useful if the data it generates is predictive of the real world — often referred to as the “Sim2Real” gap. In our experience, digital twin simulation shows a demonstrable track record of successfully closing the Sim2Real gap for autonomous robotics, across diverse environments and system types.
Additionally, a digital twin simulation workflow can be designed to make it perfectly suited for the dynamic learning conditions at the heart of the embodiment hypothesis.
In addition to aligning our digital twin workflow with the embodiment hypothesis, it can also be structured to directly address our enumerated key challenges in adapting EAI. In the tables below we identify how high-fidelity digital twin simulation capabilities match up to directly address these challenges.
With digital twin simulation, we can quickly start interacting with generative AI models and exploring techniques for embodied intelligence. While the video presented below is very early work, it gives a sense of what this workflow can look like.
This exploratory example scenario made in Falcon, our digital twin simulator, combines task coding and visual reasoning to carry out a prompted infrastructure inspection mission. The system developer or operator can immediately see the code generated from the prompt, the resulting maneuvers and even the navigation path of the drone. This kind of an integrated development environment will be crucial to improve interpretability of complex AI models and ensure predictability in the operation of smart systems. [Readers can experiment with this simulation here. First time users will need to create a free FalconCloud account]
There are of course complexities and challenges that will still need to be addressed. As a start we’re already looking into AI model reduction and cloud-edge factoring of models to optimize for latency and data bandwidth. These are necessary steps for running complex, memory intensive, and compute heavy AI models on embedded systems and out in the world under real field conditions. We are committed to advancing these solutions with our customers and partners to help realize the promising opportunities that EAI research has opened up.
We believe that digital twin simulation can accelerate the adoption of EAI and ensure that systems are developed and deployed in a cost effective and timely way without compromising safety, predictability and human oversight. Of course, as with any emerging technology, theoretical arguments can only take us so far. This is why we have been working on EAI workflows within Falcon, our digital twin platform & simulator, and creating scenarios that will allow anyone to explore EAI for themselves.
Our vision is to make this technology accessible to everyone: research teams, established market leaders, SMBs, startups and the academic community. To that end, we’re opening these scenarios to anyone, completely free of charge, and making them accessible by pixel streaming from any web browser. We invite you to see for yourself how Falcon and digital twin simulation are opening the door to EAI exploration and development, and encourage everyone to build their own understanding and become an active force in determining how AI can make its ways into the real world, and what it will do here.
To explore Falcon’s Embodied AI scenarios in your browser - simply create a free FalconCloud account: https://falcon.duality.ai