Alibaba Cloud Invests $290 Million in ShengShu to Develop AI World Models

by Mark Thompson

Alibaba Cloud is placing a significant bet on the next evolution of artificial intelligence, moving beyond the text-heavy logic of chatbots to systems that can actually “understand” the physical world. In a move to capture this shift, the company led a 2 billion yuan ($290 million) investment in ShengShu, the startup responsible for the Vidu AI video generation tool.

This strategic move marks a pivot toward “world models”—AI architectures designed to replicate real-world physics and scenarios using video and multimodal data rather than relying primarily on the text-based training used by Large Language Models (LLMs) like OpenAI’s ChatGPT. The Series B funding round also saw participation from Baidu Ventures and TAL Education.

The investment follows a rapid acceleration in ShengShu’s capitalization. Just two months prior, the three-year-traditional startup secured 600 million yuan (over $80 million) in a Series A round backed by Qiming Venture Partners. While the company has declined to disclose its current valuation, the frequency and scale of these infusions suggest a high-stakes race to dominate the “embodied AI” sector.

A mechanical hand is on display at the Robot Mall, world’s first embodied intelligent robot 4S store, on August 13, 2025 in Beijing, China. (Vcg | Visual China Group | Getty Images)

Bridging the Gap Between Digital and Physical AI

For the past few years, the AI gold rush has been defined by LLMs. These systems are exceptional at reasoning and language, but they often struggle with “common sense” regarding the physical world—such as how an object falls or how a robot should navigate a crowded room. This is where the Alibaba leads $290m investment for Shengshu Vidu AI world model initiative comes into play.

ShengShu is developing a “general world model” that integrates vision, audio, and touch. According to a statement from the startup, this multimodal approach more naturally captures the mechanics of the physical world than traditional language models. The goal is to create a seamless bridge between the digital realm—including AI-generated video and gaming—and the physical realm of autonomous driving and robotics.

Zhu Jun, the founder of ShengShu, emphasized that the objective is to “connect perception and action.” By doing so, AI systems can better model and predict real-world behavior consistently, which is a prerequisite for any machine that must operate safely and efficiently in a human environment.

The Competitive Landscape of AI Video

ShengShu’s technical capabilities are already being benchmarked against the world’s leading models. Its Vidu Q3 Pro model, released in January, currently ranks among the top 10 AI models for generating video from text and images according to Artificial Analysis.

The timing of Vidu’s global launch was notable, arriving months before OpenAI made its Sora tool widely available (a tool that OpenAI has since shuttered). This puts ShengShu in direct competition with other Chinese giants, including ByteDance and Kuaishou, both of which have released competing AI video generation tools.

Alibaba’s Broader Strategy for Embodied AI

This investment is not an isolated event but part of a broader pattern of portfolio expansion by Alibaba. The e-commerce giant is aggressively diversifying its AI bets to ensure it isn’t just a provider of cloud computing, but a leader in the physical application of AI.

In recent months, Alibaba has led several other key investments in the “spatial AI” sector:

  • Tripo AI: Alibaba and Baidu Ventures led a $50 million investment in this platform, which generates 3D models from photographs and is moving toward AI grounded in physical space.
  • PixVerse: In September, Alibaba led a $60 million investment in PixVerse, which allows users to direct the unfolding of a video in real-time.
  • Open Source Robotics: In February, Alibaba released a specific AI model designed to power robots, complementing its free, open-source video generation models.
Alibaba’s Recent Spatial AI Investments
Company Investment Amount Primary Focus
ShengShu $290 Million (Led) General World Models & Vidu Video AI
PixVerse $60 Million (Led) Real-time Video Direction
Tripo AI $50 Million (Led) 3D Model Generation

Why World Models Matter for Robotics

The industry shift toward world models is driven by a fundamental limitation in current AI. Kevin Kelly, co-founder of Wired, has noted that replicating human intelligence requires three distinct pillars: reasoning, continuous learning, and an understanding of the physical world.

While LLMs have largely solved the “knowledge” and “reasoning” elements, the “physical understanding” remains the final frontier. Without a world model, a humanoid robot cannot truly understand the consequence of a physical action—it can only predict the next most likely word or pixel. By investing in ShengShu, Alibaba is betting that the path to true artificial general intelligence (AGI) runs through the physical world.

ShengShu has already established strategic partnerships with companies developing “embodied AI”—humanoid robots designed for leverage in industrial, commercial, and home settings. These robots require the predictive power of a world model to interact with unpredictable human environments without causing damage or failure.

As these technologies move from the lab to the factory floor, the next critical checkpoint will be the integration of these general world models into commercial robotics hardware. Further updates on the deployment of Vidu’s technology in autonomous systems are expected as the Series B funding is utilized for development.

This article is for informational purposes only and does not constitute financial advice.

We would love to hear your thoughts on the shift from LLMs to world models. Share this story and join the conversation in the comments below.

You may also like

Leave a Comment