AI video models try to mimic real-world physics — but they don't understand it

AI tiger walking in snow
(Image credit: Haiper)

AI video generators can’t understand the laws of physics solely by watching videos, scientists have found.

Coming hot on the heels of chatbots and image generators, AI video generators like Sora and Runway have already been delivering impressive results. But a team of scientists from Bytedance Research, Tsinghua University, and Technion were curious to learn if such models could discover physical laws from visual data without any additional human input.

While in the real world, we understand physics through math, in the world of video generation, an AI model that understands physics should be able to watch a sequence of frames and then predict which ones come next. This should happen both when the images are ones the AI model has seen before and also unfamiliar ones.

To find out whether this understanding exists, the scientists created a 2D simulation using simple shapes and movements and created hundreds of thousands of mini videos for their model to train and be tested on. They found that the models could 'mimic' physics but not understand it.

Is SORA really a world model? - YouTube Is SORA really a world model? - YouTube
Watch On

The three fundamental physical laws for simulation they chose to study were the uniform linear motion of a ball, the perfectly elastic collision between two balls, and the parabolic motion of a ball.

Based on the team's pre-print paper, it turned out that while the shapes acted as they should for simulations based on the data they were trained on, they failed to act properly in new, unforeseen scenarios. At best, the models tried to mimic the closest training example they could find.

During the course of their experiments, the scientists also observed that the video generator often changed one shape into another (e.g. a square randomly turns into a ball) or made other nonsensical adjustments. The model's priorities appeared to follow a clear hierarchy, with color holding the highest importance, followed by size, and then velocity. Shape received the least emphasis.

Have they found a solution?

“It is challenging to determine whether a video model has learned a law instead of merely memorizing the data,” the researchers said. They explained that since the model’s internal knowledge is inaccessible, they could only infer the model’s understanding by examining its predictions on unseen scenarios.

“Our in-depth analysis suggests that video model generalization relies more on referencing similar training examples rather than learning universal rules,” they said, highlighting this happens regardless of the amount of data a model trains on.

Have they found a solution? Not yet, lead author Bingyi Kang wrote on X. “Actually, this is probably the mission of the whole AI community,” he added.

More from Tom's Guide

Category
Arrow
Arrow
Back to MacBook Air
Brand
Arrow
Storage Size
Arrow
Colour
Arrow
Storage Type
Arrow
Condition
Arrow
Price
Arrow
Any Price
Christoph Schwaiger

Christoph Schwaiger is a journalist who mainly covers technology, science, and current affairs. His stories have appeared in Tom's Guide, New Scientist, Live Science, and other established publications. Always up for joining a good discussion, Christoph enjoys speaking at events or to other journalists and has appeared on LBC and Times Radio among other outlets. He believes in giving back to the community and has served on different consultative councils. He was also a National President for Junior Chamber International (JCI), a global organization founded in the USA. You can follow him on Twitter @cschwaigermt.