How Will Internet AI Crossover to the Physical World?

A humanoid robot stands in a green field overlooking a lake

How Will Internet AI Crossover to the Physical World?

[Update, November 27th, 2023: All Embodied AI scenario mentioned in this article can now be accessed for free via the FalconCloud web app at https://falcon.duality.ai ]

Success of LLM-based tools like ChatGPT, Perplexity and MidJourney have made generative AI models widely accessible, even turning them into household names. However, despite their theorized potential, their current impact remains limited to manipulating bits on the internet. On the other hand, robots and smart systems operate in, and impact, our physical world, but their evolution is hitting significant roadblocks. Can advanced AI models be harnessed to make robots smarter, safer and natural collaborators?

At Duality, we believe this is not only possible, but can actually reduce cost and speed up the developing and deploying of autonomous robots. Since 2019 Duality has been collaborating with the industry’s leading robotics and machine learning teams, including those at DARPA, NASA-JPL, and Honeywell, to solve complex challenges facing smart systems. We have been developing digital twin simulation that enables comprehensive training and testing of autonomous systems, and has yielded real-world validation that minimizes the Sim2Real gap. It is this work that has convinced us that digital twin simulation offers the most viable way forward for safely and responsibly bringing AI into our physical world, and moving autonomous robotics past current hurdles.

Building off the growing body of research in Embodied AI (EAI), which provides a foundational framework for thinking about this fascinating question, this paper presents a feasible approach with an application focus. In the following sections we discuss the current challenges in autonomous robotics, review recent AI developments, offer potential next steps in adopting advanced AI models for real-world robotics applications, and make a case for the critical role of simulation in the embodied AI workbench.

Robotics: An unfulfilled promise

After investing $75b and over 18 years of focused development, self-driving cars still remain a work-in-progress. Coding autonomy software is a perpetual game of “whack-a-mole”. As systems tentatively make their way out of labs and expand testing into the real world, the number of edge cases increases exponentially. Explicitly coding around each one leads to mounting system complexity, ballooning costs, and endless delays in deployment, even for the most capable of robotics teams. And this is just for on-road autonomy within a handful of cities! Without a specific road to follow, off-road autonomy presents a whole new set of challenges. How do we build systems that are more resilient to the dynamic conditions of the real world and respond emergently to situations they haven’t been specifically programmed for?

Another common problem is that system intelligence is often brittle. For example, an autonomous drone developed to inspect utility poles in a particular geo-location has to be re-programmed and re-validated to detect new defects or operate in a different geolocation, even with zero changes to the drone’s hardware or sensors. Even small changes to a mission can cost upwards of $250k and operationalizing them can take 6 months or more. What if, just like ChatGPT, a drone could be prompted to perform an inspection mission instead? The operator could simply instruct the drone to “Inspect the oil pipeline for 4 kms starting at GPS location (70.25 -148.62) and look for corrosion defects similar to attached images. Log GPS location and images for every defect found.” How do we get autonomous systems, like a prompt driven inspection drone, to respond naturally and behave predictably over more generalized domains of operation?

Lastly, as expensive as it is to develop and test smart systems, deploying them, with contingency on validation, certification and integration, is an order of magnitude more difficult. This stems from a multitude of factors ranging from establishing predictability and safety in open domains to the inflexibility of robots in dealing with existing infrastructure or collaborating fluidly with humans and other systems. In most current situations, these burdens tip the ROI scale away from fully autonomous systems, resulting in a hard to manage hodgepodge of autonomous, semi-autonomous and manual systems, operating in a less than scalable fashion.

Thus the ultimate promise of the impact of robots and smart systems remains unfulfilled. Which brings us back to our main question: can we harness the recent advances in AI models, specifically generative AI, to make our machines more intelligent and capable enough to surpass these problems?

Advances in Internet AI

AI research has been steadily advancing since the major breakthroughs in deep learning and vision problems at the turn of the century. Specifically, transformer models for large language models (LLMs), such as ChatGPT, PaLM, LLama and StarCoder, have emerged to take on general reasoning tasks such as copilots for code development, smart search, and much more. Diffuser models, such as StableDiffusion, MidJourney, and DALL-E, bring multimodality and visual capabilities to the output of neural networks. Perhaps of equal importance is the ability of these models to process natural language and interact with users in an intuitive way.

There are some key commonalities in these generative AI models. Their astounding intelligence and feeling of emergence stems from extremely large neural networks and equally massive learning datasets. For example, StarCoder’s base model has 15 billion parameters and was trained on a trillion tokens. It was further tuned for the Python language on an additional 35 billion tokens. Training and even querying these models requires high performance computing and tons of memory. They are typically served by cloud providers and run in power-hungry and carefully cooled data centers. These requirements are perfectly fine for internet AI, but are less suited for distributed machine intelligence in inconsistent real-world conditions. Similarly, the only sources for massive datasets which these models consume are generally only readily available on the internet.

This has led to rapid adoption of these models on internet tasks we call bit manipulators such as generating digital media, enhancing communication capabilities, creating code, etc. Currently their scale and impact is limited in the physical world given their inability to translate their operations to atoms. In order to adopt internet AI models into the real world and enable bit manipulators to become atom manipulators, we need to first build an understanding of embodied intelligence.

Embodiment Hypothesis

In their insightful cross-disciplinary collaboration between developmental psychology and computer science, Linda Smith & Michael Gasser arrive at the embodiment hypothesis based on the observation that “traditional theories of intelligence concentrated on symbolic reasoning, paying little attention to the body and to the ways intelligence is affected by and affects the physical world.”. They further observe that, “intelligence emerges in the interaction of an agent with an environment and as a result of sensorimotor activity.”

They lay out six key lessons for developing embodied intelligent agents, noting that their learning should be:

Multimodal, ie, multiple overlapping and time-locked sensory systems
Incremental and continuous, not based on datasets and learning tasks that are fixed from the start
Physical and grounded in the interface between the body and the world
Exploratory and playful, not solely goal oriented
Social and guided by mature partners such as teachers or managers
Linguistic, providing a shared communicative system that is symbolic

The Embodiment Hypothesis holds key evolutionary insights to enabling the kind of generally adaptable robots we have been pursuing. Generative AI advances some of the learning capabilities described above, but without a physical body or embodiment, it lacks grounding and the ability to take actions. By embodying AI in a robotic system we can give it the levers for improving resilience in novel situations; generalizing learning & operating domains; and creating a more natural and contextual way to interact with smart systems – the very problems that are holding back smart systems from realizing their full potential and impact.

This idea is by no means radical — promising recent research from places like Google Research, Microsoft Research, Meta Habitat, among others, provides supportive evidence. However, in order to bring Embodied AI from the lab to the real world, we must address some key challenges.

Challenges in adopting EAI

Adopting advanced AI models in robotics poses several challenges on both the model and the data side. Compared to only manipulating bits, manipulating atoms carries much higher stakes and consequences: physical world actions cannot easily be undone. The results can include flipped over vehicles or drones that crash in populated areas with valuable payloads. Models have to be predictable and systems have to be certified for safe operation. There is simply no room for hallucination!

On the model front some current key challenges include:

Lack of predictability: Validating and certifying a system requires testing it rigorously over the intended domain of operation. Large and opaque AI models make this even more critical.
Model drift: A recent study from UC Berkeley and Stanford found that over a few months ChatGPT went from correctly answering certain math problems 98% of the time to just 2% as a result of some changes in training data. As field data is assimilated into the AI model, imagine what this would mean for a self-driving car’s operation if it behaves differently each time its autonomy software gets an over-the-air (OTA) update.
Interpretability: In order for robotics engineers to effectively work with generative AI, they need to be able to rapidly iterate and refine their own mental models. Generative AI did not hit critical mass on the internet until the open availability of MidJourney and OpenAI made it possible for everyone to quickly and easily iterate on their prompts

Equally important are the data challenges:

Lack of physical grounding: Applying generative AI models trained on internet data in the physical world requires grounding insights about system sensing and control in real world physics and constraints. Embodied AI has to learn the intricate and subtle mechanics that are familiar to every driver when they are stepping on the accelerator pedal or compensating distances while looking in the rear view mirror.
Need for domain alignment: As impressive as foundation models are at understanding generalized concepts, they still require tuning to work effectively in a given domain. For example, a foundation model can identify infrastructure corrosion, however, determining the type of corrosion requires additional curated data and model tuning.
Missing edge cases: While system resilience is one of the outcomes promised by EAI, it is still critical to have a good distribution of representative edge cases in the learning data to both incrementally evolve intelligence and test how the system will respond to uncertainty

WIth this in mind, we must now ask if it is indeed possible to take internet AI models, rebase them in the principles of embodied learning and overcome these significant challenges to adopting them as atom manipulators – or will this largely remain an area of research for the foreseeable future?

Duality’s work is based on deep rooted understanding of digital twin simulation and customer engagements with the best robotics teams across diverse system types and applications. Through our experiences of working with these teams we have come to believe that adopting generative AI models in autonomy is both possible and can be immediately beneficial. It will, however, require selective, tactical, and application-centered focus and the development of new workflows and supporting tools. In the following two sections we expand on this tactical focus, and new workflows that can support EAI development.

Application-centered Focus

Research on generative AI and artificial general intelligence (AGI) continues at a frantic pace with over a trillion dollars being invested over the next several years in the training processors alone. Even as this work continues, we believe that current LLMs and multimodal models are already sufficiently capable to go after autonomy problems that have plagued smart systems. Duality’s approach is to have a narrower application focus that goes after well defined tactical challenges.

By necessity, our approach is hybrid, and builds on aspects of autonomy development that are well understood and widely used. For example, current methods for feedback controllers or sensor signal processing are efficient and predictable. Leveraging that part of the autonomy stack while augmenting it with generative AI models makes more sense to us than pursuing comprehensive but elusive end-to-end models or reinforcement learning for low level policy.

In particular, we have identified two areas for active development that we find promising:

‍Task Coding: Takes advantage of the code generation capabilities of LLMs to perform general reasoning and take higher level mission goals, framed as prompts, and turns them into well defined coded tasks capable of being directly run on the system. It is important to clarify that the LLM is not so much writing code as using language reasoning to turn a prompted mission into a sequence of pre-defined executable tasks. Task coding can be an effective way to make systems less brittle and generalize domains of operations while still remaining predictable, as in the infrastructure inspection drone example mentioned earlier. [Readers can experiment with the Task Coding here. First time users will need to create a free FalconCloud account]

‍Visual Reasoning: Takes advantage of visual-language based models such as GroundingDINO, ViperGPT and VisProg. This allows the AI enabled sensing capabilities of robots to go beyond narrowly trained neural networks for classification, object detection and segmentation and into foundation models and zero-shot detection in unseen environments. Furthermore, it enables language-based reasoning of what is in individual sensor samples as well as synchronized sequence of multi-sensor data over time. We believe that this enhanced sensing capability can make systems more resilient, especially to novel scenarios, and help them “think on their feet”. [Readers can experiment with the Visual Reasoning here. First time users will need to create a free FalconCloud account]

This kind of tactical focus presents an approach that is carefully balanced for predictable system behavior based on current AI capabilities. Rather than mirroring human psychological development where babies are born with a minimal set of onboard sensing and motor functions, smart systems can be pre-loaded with a base set of well understood and predictable capabilities, as well language and communication skills, that they can scaffold from while still remaining adaptable and resilient. Rather than a baby learning to walk, this is akin to a college grad onboarding into their first job. We also acknowledge that the best balance for predictable systems will be a moving target as machine intelligence continues its march towards AGI.

Workflow Changes

It is important to address that weaving in advanced AI models into system capabilities in this way does require new workflows that take a more holistic view of the autonomy stack. Current approaches to autonomy development often break the autonomy problem into a hierarchy of functional layers: perception, planning, navigation and controls. To help imbue predictability and desirable behavior into the overall system, conventional simulation and testing are commonly applied separately to each individual layer. This approach however does not lend itself to EAI.

By its very nature, EAI is closed-loop and requires a full closed loop simulator to exercise all layers of the autonomy stack in close coordination. It’s easy to see why when we consider even basic motor skills— such as the ones exercised when we reach for an apple that someone hands to us. This requires a continuous, closed-loop sensing and adjustment of our arm and hand position – a surprisingly hard and dynamic problem that we perform effortlessly. Since embodied intelligence is continuously adjusting tasks and even the sequencing of tasks in order to achieve the overall mission, it is similarly predicated on a tight feedback loop of sense-plan-act.

To work with EAI, development teams will need the ability to quickly adopt, interpret and tune AI models that best serve their specific needs. Current structures for accessing internet AI models may not be well suited for this. As an example, LLMs may only be called infrequently for task coding and refining mission behaviors in a simulated and adequately networked environment, while visual reasoning may have to be performed at a much higher frequency and directly on raw sensor streams out in the field where connectivity could be poor or non-existent. Understanding how intelligence is spread across the cloud and edge will be critical and will be tied intrinsically to network connectivity and the embedded system capabilities of edge devices.

Additionally, as robots move away from highly programmed and pre-defined behaviors towards more generalized intelligence, the role of a human supervisor or operator becomes more important. This can take the form of training and on-boarding smart systems to specific jobs and operating environments as well as double checking their intended behaviors when given new or stretched responsibilities – not so different from the manager-direct report structures of our human workplaces.

Embodied AI & Digital Twin Simulation

As we just discussed, holistic and closed-loop simulation is critical to integrating and operating EAI systems. Luis Bermudez, Intel’s Machine Learning Product Manager captures this very succinctly: “Datasets have been a key driver of progress in Internet AI. With Embodied AI, simulators will assume the role played previously by datasets.” He further points out that, “Datasets consist of 3D scans of an environment. These datasets represent 3D scenes of a house, or a lab, a room, or the outside world. These 3D scans however, do not let an agent ‘walk’ through it or interact with it. Simulators allow the embodied agent to physically interact with the environment and walk through it”.

With the above in mind, it is no stretch to say that digital twin simulation can provide a workbench that is perfectly tailored to EAI development. Let’s take a look into how digital twin simulation supports EAI work.

First and foremost, a digital twin simulator is only useful if the data it generates is predictive of the real world — often referred to as the “Sim2Real” gap. In our experience, digital twin simulation shows a demonstrable track record of successfully closing the Sim2Real gap for autonomous robotics, across diverse environments and system types.

Additionally, a digital twin simulation workflow can be designed to make it perfectly suited for the dynamic learning conditions at the heart of the embodiment hypothesis.

Multimodal - simulation enables collection of clean synchronized data from multi sensor modalities.
Incremental - ease in authoring diverse simulation scenarios enables users to gradually introduce complexity and curate learning steps.
Physical - as long as the simulation is grounded in physics at an appropriate fidelity, the EAI can learn to operate in any real-world dynamic conditions: water, land, air, space.
Exploratory - large scale, complex, and inhabited virtual environments enable EAI systems to explore their world cost effectively and safely.
Social - allowing humans to interact with EAI in the same real time simulation scenarios enables social learning that is very difficult to replicate in the real world.
Linguistic - simulations driven by natural language interfaces provide learning pathways that couple language and abstract reasoning with tasks to be performed.

In addition to aligning our digital twin workflow with the embodiment hypothesis, it can also be structured to directly address our enumerated key challenges in adapting EAI. In the tables below we identify how high-fidelity digital twin simulation capabilities match up to directly address these challenges.

Table 1. Digital twin simulation capabilities that address key Embodied AI model issues

Table 2. Digital twin simulation capabilities that address key data hurdles for robotics

With digital twin simulation, we can quickly start interacting with generative AI models and exploring techniques for embodied intelligence. While the video presented below is very early work, it gives a sense of what this workflow can look like.

This exploratory example scenario made in Falcon, our digital twin simulator, combines task coding and visual reasoning to carry out a prompted infrastructure inspection mission. The system developer or operator can immediately see the code generated from the prompt, the resulting maneuvers and even the navigation path of the drone. This kind of an integrated development environment will be crucial to improve interpretability of complex AI models and ensure predictability in the operation of smart systems. [Readers can experiment with this simulation here. First time users will need to create a free FalconCloud account]

There are of course complexities and challenges that will still need to be addressed. As a start we’re already looking into AI model reduction and cloud-edge factoring of models to optimize for latency and data bandwidth. These are necessary steps for running complex, memory intensive, and compute heavy AI models on embedded systems and out in the world under real field conditions. We are committed to advancing these solutions with our customers and partners to help realize the promising opportunities that EAI research has opened up.

Experiment with Embodied AI For Yourself

We believe that digital twin simulation can accelerate the adoption of EAI and ensure that systems are developed and deployed in a cost effective and timely way without compromising safety, predictability and human oversight. Of course, as with any emerging technology, theoretical arguments can only take us so far. This is why we have been working on EAI workflows within Falcon, our digital twin platform & simulator, and creating scenarios that will allow anyone to explore EAI for themselves.

Our vision is to make this technology accessible to everyone: research teams, established market leaders, SMBs, startups and the academic community. To that end, we’re opening these scenarios to anyone, completely free of charge, and making them accessible by pixel streaming from any web browser. We invite you to see for yourself how Falcon and digital twin simulation are opening the door to EAI exploration and development, and encourage everyone to build their own understanding and become an active force in determining how AI can make its ways into the real world, and what it will do here.

To explore Falcon’s Embodied AI scenarios in your browser - simply create a free FalconCloud account: https://falcon.duality.ai

‍