The video examples generated by SORA, recently showcased by OpenAI, are varied; impressive in their visual fidelity; and temporally coherent [1]. An astounding step forward in the quality of generative AI video output! Particularly interesting is the physical grounding evident in some of the clips such as the jeep traveling along a hillside road kicking up dirt, and clips of dogs with realistic body articulation and fur simulation. The applications in entertainment and content creation are obvious and direct. However, OpenAI’s technical report on SORA describes it as, “Video generation models as world simulators.” [2][3]. This leads us to ask: Can SORA’s output be used to train other AI perception models?
Video Example 1. Generated by SORA, the video of the SUV navigating on a dirt road exhibits photorealism and plausible physics
To answer this question, we have to clearly define the characteristic of synthetic data needed for AI mode training, and contrast that with the kind of data that models like SORA can provide. Let’s start with what we know about SORA.
SORA is a diffusion transformer model. In very simple terms, it is a hybrid of diffusion models that generate images based on text prompts, such as Stable Diffusion and Dall-E, combined with massive attention enabled transformers at the heart of modern large language models (LLMs), such as ChatGPT and Llama [4][5].
Vital to our question is OpenAI’s belief that the massive amounts of training data and scale of SORA’s model leads to emergent behaviors that include:
These characteristics are typically associated with physically based simulators — so it is nothing short of astounding that SORA’s generative model appears to have “learned” complex physical laws and their situational application in a data-driven manner!
It’s important to note the caveats that OpenAI acknowledges on the SORA product page:
“The current model has weaknesses. It may struggle with accurately simulating the physics of a complex scene, and may not understand specific instances of cause and effect. For example, a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark.
The model may also confuse spatial details of a prompt, for example, mixing up left and right, and may struggle with precise descriptions of events that take place over time, like following a specific camera trajectory.”
The research report further underscores:
“Sora currently exhibits numerous limitations as a simulator. For example, it does not accurately model the physics of many basic interactions, like glass shattering. Other interactions, like eating food, do not always yield correct changes in object state. We enumerate other common failure modes of the model — such as incoherencies that develop in long duration samples or spontaneous appearances of objects — in our landing page.”
Video Example 2. "Sora currently exhibits numerous limitations as a simulator. For example, it does not accurately model the physics of many basic interactions, like glass shattering." [Source: OpenAI]
Video Example 3. "Sora sometimes creates physically implausible motion." [Source: OpenAI]
While the question of using SORA videos for training other AI models is not domain specific, for the purposes of our speculation, we will focus on synthetic sensor data used in embodied AI and robotics applications. This is the focus of Duality’s work with our customers across defense and commercial teams. Our current approach, which often includes generating synthetic data for AI model training, leverages physically based digital twin simulation with Falcon [6].
Video Example 4. Generating synthetic data via physically based, deterministic digital twin simulation in Falcon. Digital twins of operational systems, equipped with virtual sensors (e.g. RGB camera, Lidar, GPS, and more) navigate customizable, photorealistic environments to generate any needed data.
From a simulation perspective, there are three key areas that need consideration:
While SORA’s cause-effect weakness is certainly relevant to dynamics, it is important to reiterate that photoreal rendering is also a physical simulation of electromagnetic waves andphotons with a similar dependence on cause and effect. For example, moving an object in front of a light source should cast an appropriate shadow. In a more complex instance, dimming a particular light within a room should cause its contribution to surfaces to be correspondingly reduced with a recalculation of its bidirectional reflectance distribution function (BRDF) response.
Language can get pretty squishy when talking about modeling and simulation, especially since cognitive science, mathematics, physics, computer graphics and AI all have their own subtle nuances and assumptions. In order to clearly frame our fundamental questions, let’s zoom out first and establish what precisely is a model for our purposes and what is the ground truth we can use to objectively compare disparate models?
Based on observations of a real-world phenomenon, we can create a model that can predict (based on input specifications) what will happen in reality. These specifications can take the form of a set of initial conditions for a physically based simulator, or be presented as a natural language prompt, as done with generative AI.
Hypothetically, in the limit case, one could make a model to be an exact replica of the real world, down to the sub-atomic level — but such a model would provide no benefit over reality. The real world generally has a lot of extraneous information and dimensions that must be abstracted so that the model can predict efficiently. This pulls in the necessity of a domain over which the model is considered to be valid.
Finally, the predictive value of a model can be ascertained by looking at the delta between the actual data of interest produced from the real world, i.e. ground truth, and the predicted or synthetic data produced by our model given input conditions that are matched to reality over the domain of interest.
With that grounding, we can frame our fundamental questions:
“What gets measured, gets improved.”
- Peter Drucker
Keeping the above quote in mind, the quality of model prediction can be measured in three ways over the domain of interest:
Physical phenomena are objective, deterministic, and generally continuous. They are subject to precise inputs and corresponding outputs rather than a probabilistic distribution. This makes measurement straight forward. But how do we quantitatively measure output from generative AI models, such as SORA, against physical ground truth, to establish predictive quality?
For AI models, accuracy of model output in relation to ground truth test data (held out from training) has generally been used as a primary measure of predictive value. However, evaluating generative models, especially diffusers, is very tricky since the input is specified as text, with its inherent ambiguity and symbolic abstraction [9]. Further, the output is generated as an image or video without any introspectable and explicit model structure that can be instrumented to measure physical properties such as mass, volume, velocity, acceleration, pressure, temperature, etc. Generative AI models are also subjective and do not have a fixed coordinate frame. For example, precise measurement of the distance between two objects in a diffusion image is not possible when the virtual camera position (the model’s locus of observation) itself is uncertain.
Recent research attempts to solve this by recreating the output of a generative AI model in explicit and measurable form [10]. It is an intriguing direction and could ultimately intersect with physically based simulation approaches but a true ground truth baseline may remain elusive.
Putting aside the inherent difficulty in quantitative measurement of generative AI model output, let us turn our attention to the fundamental capabilities of generative AI as a world simulator.
Given that AI models are based on neural networks, on first blush, it would seem that AI broadly, and generative AI in particular, maps closely to what cognitive science refers to as mental models [11]. And certainly there are some obvious similarities such as the ability to generalize and extrapolate; bias stemming from learning data or experience; and the ability to generate out of domain answers even if they have low predictive value.
However, it is important to keep in mind that even the neural network representation is ultimately a mathematical model and the parallels to cognitive science shouldn’t be taken too literally.
When AI is viewed as mathematical modeling, a number of avenues open up for consistent and precise physical modeling. Some of the areas of active research include run-of-the-mill domain specific tuning; reward functions that capture physical constraints; hybrid models that combine AI with physically based models; and augmenting model learning data and state with additional dimensions [12][13]. We must therefore conclude that AI models, as a subset of mathematical models, are absolutely capable of high quality physical modeling, at least in terms of consistency and precision.1
With the above in mind, we can put forth some speculation on the predictive value generative AI models in terms of consistency, precision and explanative value:
The efficiency of prediction can be decomposed into three components:
The laws of physics are well established and represent human intellectual investment that goes back many millennia. These are classical models whose data requirements can range from none to a very minimal set for parameter tuning. AI model setup, however, requires extensive data gathering and retraining when either the training data or the model architecture changes in meaningful ways. The cost for sourcing high quality, comprehensive, annotated and error-free training data is expensive in both time and resources. In fact, this is the primary motivation to look for synthetic data in the first place!
All AI models learn implicitly from data, with the general mantra being: “the more data the better.” This is the reason why foundational models seem to exhibit unbounded improvement when trained on larger and larger corpuses of data. However, not all data is equal – data that is varied and plugs blind spots within the domain of operation are far more valuable than data points that are clustered together or out of domain. There are only two ways to get this data, either from the real world or from other modeling approaches which may cascade inconsistencies or imprecisions (more on this in the next section).
When it comes to robotics applications, we also have to consider sensors that are outside what is visible to the human eye, which makes up less than 1% of the electromagnetic spectrum – ultraviolet; short, mid and long range infrared; sonar; radar. The real world training data corpus for these sensor types is significantly smaller, especially in the public domain. Can generative AI turn to other forms of physically based simulation, such as Falcon, to close these data gaps?
The cost of context setup for traditional approaches, such as digital twin simulation, often boils down to creating a specific context that is particular to the application and data requirements. Since physically based simulation requires precise inputs, there isn’t a direct mapping to natural language and in general a lack of generalization. This is an area in which generative AI models truly shine. They are highly generalized and furthermore can take natural language prompts and example images or videos to establish context in an efficient and intuitive way for model users.
In terms of model prediction, there are some precision/quantization tradeoffs that can be made in any model, but all things being equal, it ultimately comes down to compute. Both physically based approaches as well as neural network processing are highly parallelized and can run on GPUs. Overall this is likely to be fairly comparable between different modeling approaches.3
There is also complex crosstalk between the quality and efficiency characteristics. For example, while an AI models can be tuned or otherwise constrained, this may generally come at the expense of generalization and introduce context setup cost.
Will generative AI subsume all other forms of modeling? While theoretically possible, it is more likely that it becomes an all-encompassing interface to specify the simulation context and then marshals necessary models to create and process the requested data. But here we may be getting out over our skis — the first crucial step is to baseline where generative AI models are today when it comes to physical modeling, and compare their quality and performance against real world ground truth data as well as other physically based simulation approaches.
While it is possible to test generative AI output in a component-wise manner, such as looking at just the shadows or reflections, at Duality we use a more comprehensive approach, the 3i framework [17]. We evaluate synthetic data across three criteria: Indistinguishability, Information Richness and Intentionality. In our customer engagements, we have found this evaluation framework to be extremely helpful at predicting positive end-to-end outcomes that are both model and context agnostic. As soon as our team can access SORA, we are eager to begin this work!
To revisit an above mentioned word of caution: when AI generated data is used to train other AI models, we also need to consider the carbon copy effect. Programmers are familiar with the cascading impact of imprecision that can snowball into wild numeric instability. Similarly, as each generation of AI model is trained on a subset of another AI model’s output, how does the quality deteriorate? In the middle ages, before the laws of thermodynamics were firmly established, innovators were obsessed with finding the perpetual motion machine until the uncompromising laws of physics effectively debunked the effort [18]. Will history look back on the current time as a quest for the perpetual data machine?
Lastly, it is heartening to see the increased focus (and daunting challenges) when it comes to safety of generative AI deep fakes, copyright violations and ethical use. However, from the lens of embodied AI and robotics, safety is a much more literal concept – the lack of consistent and precise AI output can result in bodily harm or loss of critical services. It is ok for the braking distance of a car to look generally plausible in a video, however, a few decimeters can be the difference between a safe stop and a collision. There is no undo button in the real world!
At Duality, we see a dynamic give-and-take between generative AI and digital twin simulation across model training, context creation and ultimately to produce the highest quality synthetic data for our customers that lead to safe and predictable AI models and robots. Ultimately, that is the only true measure of success.
[1] OpenAI, “Sora: Creating video from text.” https://openai.com/sora, 2024.
[2] OpenAI, “Video generation models as world simulators.” https://openai.com/research/video-generation-models-as-world-simulators, 2024.
[3] Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, Lifang He, Lichao Sun, “Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models.” https://arxiv.org/abs/2402.17177, 2024.
[4] Jonathan Ho, Ajay Jain, Pieter Abbeel, “Denoising Diffusion Probabilistic Models.” https://arxiv.org/abs/2006.11239, 2020.
[5] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, “Attention Is All You Need”, https://arxiv.org/abs/1706.03762, 2017.
[6] Duality AI, “Falcon: Digital Twin Simulation Platform.”, https://www.duality.ai/product, 2024.
[7] Apurva Shah, “How Will Internet AI Crossover to the Physical World?” https://www.duality.ai/blog/embodied-ai, 2023.
[8] Felipe Mejia, “ViperGPT Takes a Walk in the Park: Evaluating Vision Systems for Embodied AI in FalconCloud.”, https://www.duality.ai/blog/vipergpt-for-embodied-ai, 2024.
[9] Hugging Face, “Evaluating Diffusion Models.” https://huggingface.co/docs/diffusers/main/en/conceptual/evaluation, 2024.
[10] Xuanyi Li, Daquan Zhou, Chenxu Zhang, Shaodong Wei, Qibin Hou, Ming-Ming Cheng, “Sora Generates Videos with Stunning Geometrical Consistency.” https://arxiv.org/abs/2402.17403, 2024.
[11] Ileana María Greca, Marco Antonio Moreira, “Mental, physical, and mathematical models in the teaching and learning of physics.” Science Education, Volume86, Issue1, Wiley, 2002.
[12] Nils Thuerey, Philipp Holl, Maximilian Mueller, Patrick Schnell, Felix Trost, Kiwon Um, “Physics-based Deep Learning.” https://www.physicsbaseddeeplearning.org, 2021.
[13] Rachel Gordon, “From physics to generative AI: An AI model for advanced pattern generation.” MIT News, September 27, 2023.
[14] Yang Song, Prafulla Dhariwal, Mark Chen, Ilya Sutskever, “Consistency Models.” https://arxiv.org/abs/2303.01469, 2023.
[15] L. Kryeziu and V. Shehu, "A Survey of Using Unsupervised Learning Techniques in Building Masked Language Models for Low Resource Languages," 11th Mediterranean Conference on Embedded Computing, 2022.
[16] Francesco Leacche, Roberto De Ioris, Amey Godse, Apurva Shah, “The Digital Twin Encapsulation Standard: An Open Standard Proposal for Simulation-Ready Digital Twins.” Interservice/Industry Training, Simulation, and Education Conference (I/ITSEC), 2023.
[17] Duality AI, “Is Your Machine Learning Model Bingeing on Junk Data?” https://www.duality.ai/blog/ml-synthetic-data-model, 2022.
[18] Paul Scheerbart, “The Perpetual Motion Machine: A Story of Invention.” Wakefield Press, 2011.