Sora, OpenAI’s commitment to creating videos from text

by times news cr

2024-04-04 12:13:58

On February 15, OpenAIthe hybrid company which has in the names of its founders personalities such as Sam Altman y Elon Musk, advertisement Sora, their model for creating video from text. A configuration that follows other of its stellar products such as ChatGPTDALL-E y GPT-4.

Sora can generate video up to one minute long while maintaining visual quality and compliance with user instructions,” indicates the model’s capacity section. “(..) is only available to red team members to assess critical areas for damage or risks.” However, they add, they are giving access to visual artists, filmmakers and designers to provide critical observations in order to grow this model.

As their videos show, always accompanied by the promptsthat is, the descriptions or notes of what the model is asked to perform, are generated “complex scenes with multiple characters, specific types of movement, and precise subject and background details”. To do this, given that this tool has been nurtured and trained for a long time, Sora not only understands “what the user has requested in the message, but also how those things exist in the physical world.”

You may be interested in: How to create images with AI Bing

OpenAI’s research on this model

According to the report, which can be consulted and studied in its entirety on the website OpenAIfocuses on two main aspects: the company’s own method for converting visual data of all types into a unified representation that allows for large-scale training of generative models and the qualitative evaluation of Sora’s capabilities and limitations.

Although the details of the model and implementation are not searchable data, it does indicate what type of methods have been used to study it, for example: recurrent networks, generative adversarial networks, autoregressive transformers and diffusion models.

They also stand out, emphasizing that while the LLM (Large Language Models) have text tokens, Sora has visual patches, as they discovered that these “are an effective and highly scalable representation for training generative models on various types of videos and images.”

You may be interested in: Announces Meta tagging of images generated by AI

Likewise, to exemplify the training of a network that reduces the dimensionality of visual data, they explain that the “network takes a raw video as input and generates a latent representation that is compressed both temporally and spatially”. This is where Sora activates her abilities: after training, she “generates videos within this compressed latent space.”

His research also focuses on a review of latent space-time patches, scaling transformers for video generation (one of the most extraordinary aspects of operation), variable durations, resolutions and aspect ratios, flexibility of sampling, framing and improved composition, language understanding, among other things.

Another relevant aspect to take into account is the animation of images through DALL-E, since Sora is capable of generating videos based on an image and a message. The examples shown by the platform resemble introductory animations, perhaps something similar to a GIF’s or an animated entry message.

You may be interested: Twitch announces the elimination of the Prime sub and more changes

In the videos section, there are three notable details of the tool. Firstly, the extension of generated videos, that is, an extension in retrospect or forward, but from the same segment. Second, video-to-video editing, applied through SDEdit, which “allows Sora to transform the styles and environments of input videos without triggering.” Finally, video linking, which means a gradual interpolation between two input videos, “creating seamless transitions between videos with completely different themes and scene compositions.”

In contrast, for a closer approach to television, video games and cinema, perhaps not as a base object but as a point of arrival, the research details the emerging simulation capabilities, which are, broadly speaking, the “capabilities (that ) allow Sora to simulate some aspects of people, animals, and environments in the physical world.” He highlights the way in which “properties emerge without any explicit inductive bias for 3D, objects,” as they are “purely scale phenomena.” The capabilities revealed by the research are: 3D consistency, long-range coherence and object permanence, interacting with the world and simulating digital worlds.

In case you missed it: Samsung presents the new “Galaxy S24” and its new AI

“We believe that the capabilities that Sora has today demonstrate that the continuous scaling of video models is a promising path towards the development of simulators capable of the physical and digital world, and of the objects, animals and people that live within them,” they conclude. .


2024-04-04 12:13:58

You may also like

Leave a Comment