Sunday, February 18, 2024

I am become life, the creator of worlds?

Can words now create worlds? Has AI suddenly acquired God-like creation qualities?

A short video of two black pirate ships sailing in a stormy sea of coffee, within a coffee cup, has caused a splash by showing what can be done with video but also created a stir by suggesting something astonishing. My son is a games player and AI expert. His immediate reaction on seeing the Sora videos was the slightly perfect and gamesy feel of the images. 

Could OpenAI have developed something truly astonishing here – a 3D world simulator? Is this the converse of Oppenheimer’s famous statement, 

I am become life, the creator of worlds?

To date Generative AI was limited to text and images and lacked a model of the world, of a 3D space, time, causality and action. 

This video show a video generated from a prompt, astounding in itself but what it may reveal is the following:

Possibility in the future of a physics engine that understands how objects behave, in this case the two pirate ships that never collide. That they do not collide is relevant as they must know several the position of the two boats at all times in a 3D space. It is physics that grounds models in reality. Note that the way this works is not by having a physics and collision engine, only that the data, from computer games will have been created using such tools.

Behaviour of the ships on the water suggests the possible future detailed knowledge of fluid dynamics, as the coffee whirls around and even creates waves and froth. Again, it is not being created from the mathematics of fluid dynamics but a clever diffusion model

Cup size and limitations of the cup space, showing a knowledge of small object and the ability to scale two very large objects down into a small space.

Sharp realism with correct lighting and shadows is also astonishing. This is not a rendering engine but, again, a trained diffusion model.

There are suggestions that this could have been training using data from Unreal, the games’ engine, in particular, synthetic data from that engine. YouTube and others sources are also clearly in there. This means it is trained on a combination of real and virtually created worlds. There also seems to be a time component. This is interesting, as that variable is missing in other modes.

If they have created such a thing, this is far more than just video creation. It is a step towards the ideea of the creation of 3D worlds using AI, something I mentioned in my book on Learning in the Metaverse. Being able to create any 3D world is a far bigger deal than video, as it opens the way for another revolution in media and learning. We are nowhere near that yet.

In truth there are two opposing routes to solving this problem. and both were released this week - OpenAI/Sora v Meta/V-JEPA. OpenAI has developed Sora, recognised for its text and video-to-video modelling capabilities, aiming ultimately to create a world simulator. However, Meta's AI chief, Yann LeCun, criticises this method, considering it impractical and likely to be unsuccessful. He contends that generative models are not suitable for processing sensory inputs due to the high level of prediction uncertainty associated with high-dimensional continuous sensory data.
In response, LeCun has introduced his own AI model, V-JEPA. This model utilises a non-generative approach and is designed to predict and interpret complex interactions. Its primary function is to understand the dynamics of objects and interactions, thereby enhancing the AI's comprehension of these elements.

We are 3D people, living in a 3D world doing 3D things with other 3D people and 3D things. Yet, bizarrely, most teaching and training if from the flat, 2D page – text, books, graphics, PowerPoints and screens of e-learning. This has always been largely suboptimal and prevents actual learning of skills and transfer.

In the beginning was the word and now we, like small Gods, can use the word to create new worlds. We are in dialogue with the world to create new ones. That simply act of saying something can make it appear, breathe life into that world. I find that more than interesting, it is staggering.

We may have, in this tool, the ability to create worlds, any world, on any scale, in 3D by simply asking it? If so, this is a threshold that has been crossed. We will be able to create worlds in which we work, interact and get things done. Also worlds in which we teach, train and learn. Even worlds in which we socialise and get entertained. We may be doing what has only ever done on a limited scale in incredibly expensive simulators and computer games – understand and create new worlds.  Multimodal may now mean a grand convergence.

No comments: