In What is the Metaverse?, we provided an overview of the diverse contexts in which the metaverse comes up. It's applications and disruptive influence span across myriad areas including social interaction; online games & entertainment; virtual, augmented & mixed realities; the future of the internet; and tokenized & gig economies.
In this blog, we will dive into the interaction and technical details. We will identify the commonality between these seemingly disparate ideas through the lens of interaction design and put forward a unified, practice oriented definition of the metaverse.
The core interaction loop consists of a user interacting with the metaverse's simulated three dimensional (3D) environment by receiving a perception stream and in turn sending a stream of actions. On the user side, a sensory interface converts metaverse signals to a form suitable for human perception and conversely an actuation interface turns human directives into simulation commands and actions.
We discuss each component of the core interaction loop using the example of a single player role-playing game that the user interacts with on their personal computer (PC) or game console.
The metaverse environment can be a castle, an enchanted forest or even an entire planetary system. The environment does not have to be just static models -- a castle can have flags fluttering in the wind and a forest can have a brook rippling through it. The world of the metaverse can have properties that change continuously such as time of day, a thunder storm or a forest fire. The evolution of the environment, which can happen many times in a second, is managed by the metaverse simulation engine.
A user joins the metaverse as a character playing a specific role, referred to as an avatar in gaming. The character embodies a physical presence of the user inside the 3D metaverse environment and is bound to that user for the duration of their metaverse session. In the castle metaverse, the user could play the character of a solider guarding the ramparts or a knight competing in a joust. Similarly, in the enchanted forest, the user can embody an elf archer or a hobbit builder.
To establish a perception stream, a first person game attaches a virtual camera and microphone to the avatar's head to capture an image and audio samples from their point of view (POV). The simulated processes by which a virtual camera or microphone work, i.e. rendering and audio mixing, are analogous to their physical counterparts. These perception streams are played back on the user's monitor and speakers. The user processes these sensory signals and takes a decision on how to navigate within the environment or take some other action. They use the keyboard & mouse or joystick to actuate actions that are sent back to the metaverse simulation engine to articulate the user's character or effect other changes in the environment under the character's direct control.
The interaction loop then starts all over again. This process can repeat 60 or even 120 times a second creating an illusion of real time continuity and direct control similar to how film frames in cinema create the illusion of continuous motion.
Core interaction loop are the foundation of all real time human-computer interaction (HCI) whether it's a video game or the World Wide Web. For our purposes, it provides a common lens from which we can look at all types of metaverse experiences.
The degree to which a user feels like they are actually inside the 3D environment determines how immersive the metaverse experience is. While superficially this is tied to realism, the ability to make the user willingly suspend disbelief and lose themselves in a book, movie or game is a subtle creative art. For video games and by extension the metaverse, game mechanics are a critical design tool to enhance immersion.
Another important aspect of immersion is latency. This is the end-to-end speed at which the interaction loop runs, for example, how quickly the virtual camera image updates as the user commands their hobbit's head to turn. Unlike film or video where 24 or 30 frames per second are sufficient to create an illusion of continuous motion, the latency with which the metaverse simulator responds to user commands, especially with VR headsets, has to be under 10 ms, in other words 100 frames per second or more.
While immersion focuses on the perception arc of the core interaction loop, similar considerations also apply to the action arc. HCI is dominated by screen based affordance and the Web's two dimensional canvas. However, as humans we interact with the world around us using a rich set of natural affordances that take advantage of our entire body rather than just scrolling the mouse with our wrist or pecking with our fingers. In the metaverse the user can control a full body, 3D presence of their character. This opens up opportunities to interact with what we call direct presence. In our RPG game example, instead of using a 2D map to set a target location inside the castle, the user instead commands their knight to walk or run to it. This extends to the actuation interface as well. In the hobbit head turn example, a user's VR headset tracks their head motion with sensors and directly drives the hobbit's head position inside the metaverse simulation.
Until now, we have only discussed a single user metaverse. What if we would like to have multiple users in a metaverse at the same time so that they are all sharing the same 3D environment context and aware of each others presence? This would take our RPG game example and turn it into a massively multiplayer online role-playing game (MMORPG). In a MMORPG, our castle would be populated by many soldiers and knights led by a king with a requisite princess locked away in the highest tower.
In order for all the users to share a metaverse, their 3D environment must be synchronizedwith each other. Client-server architectures are common in cloud computing, including the Web, where users connect to a centralized server by specifying it's URL and exchanging data with it over the internet through their (client) browser. For a shared metaverse, it is the responsibility of the metaverse server to exchange information about the environment with individual user's metaverse clients to enforce synchronization between them. This synchronization, in turn, creates an illusion that all the participants are in the same space at the same time.
As with immersion, reducing latency with which metaverse clients synchronize their current state is key to establishing a shared experience. Synchronization latency is a function of the internet's network speed and the efficiency with which the metaverse simulation engine is able to communicate and incorporate environmental updates.
Besides the ability to synchronously share an environment, persistence is so basic and ubiquitous in our physical world that we don't even think about it. However, violating it swiftly breaks the user's sense of immersion and presence as their mental model of the environment diverges from their metaverse experience. We express persistence as the continuity of time, space and state. For the user, persistence operates at two levels -- continuity while they are within the experience and continuity of the world between sessions. An example of persistence inside the experience, is how the environment's lighting changes to show the passage of time. An example of persistence between sessions and over a longer time horizon, is the wearing down of the environment with repeated use.
Closely tied to the idea of persistence of state, is personalization, especially in socially motivated metaverses. Game mechanics can support building in the environment or customizing the character's persona. Users expect that the investment they make in personalizing their avatars or environments is preserved and can be shared with others through persistence of state.
Building on the foundation of the core interaction loop and concepts of immersion, presence and shared context; we can finally define the metaverse:
The metaverse is a synchronously shared and persistent three dimensional context where users, embodied as characters, navigate immersively and interact through direct presence.
This unified definition applies to the diversity of metaverses we discussed earlier and is stripped of implementation and technical concerns, making it suitable for collaboration between practitioners coming from different backgrounds and perspectives.
In this blog, we started with an understanding of the core interaction loop and arrived at a common, practice oriented definition of the metaverse.
In the interest of clarity, I have simplified several topics. For example, synchronization between metaverse clients is not limited to client-server architectures and is an open area of research. From an interaction perspective, clarifying the relationship between skeuomorphism and interaction through direct presence needs elaboration. We glossed over the difference between an autonomous avatar versus an AI agent in the metaverse environment. In future blogs, I hope to revisit some of these topics and discuss them in greater detail.
Concepts of immersion, persistence, synchronization and others that we have not yet discussed, are the key characteristics of a metaverse. Developing a metaverse requires carefully designing each of these dimensions keeping implementation constraints in mind. Alternatively, how a metaverse operates and it's intended application, can be understood by studying these key characteristics. In the next blog, we present a metaverse design framework for exactly this kind of development and analysis.
A very special thanks to Karan Parikh, Elena Macomber and Amey Godse for their review and feedback.