A Functional Taxonomy of World Models
Dr. Fei-Fei Li · Fei-Fei Li
Today's "world models" split into three things — renderers that output pixels, simulators that output structurally faithful state, and planners that output actions — all projections of the same POMDP loop and the same underlying knowledge of geometry, physics, and dynamics.
The taxonomy matters because the field uses one phrase for three different output contracts, and conflating them lets visually impressive demos pass as physical understanding. The simulator — the one whose contract is structural rather than visual — gets the least attention and carries the most weight for anything beyond watching: it is what programs and planners actually compute on. Sorting the category cleanly is the precondition for arguing about which systems are close to general world modeling and which are just good at pixels.
Geometry, physics, and dynamics sit beneath rendering, simulation, and planning alike. A model that can render a cup from any angle ought, in principle, to simulate it being pushed and plan a hand to pick it up.
Of the three categories, the simulator gets the least public attention but matters the most. The essay sets out to correct this asymmetry.
The agent-action-state-observation loop from reinforcement learning, formalized as the partially observable Markov decision process, is where "world model" got its modern technical meaning. The different things called world models today are projections of this loop, each outputting a different piece.
A renderer turns inputs into pixels, with visual fidelity as the contract. Text-to-video systems and interactive frame generators like Genie 3 and RTFM fit here — they produce what a viewer would see, not what is.
A simulator produces a geometrically, physically, and dynamically faithful representation that both humans and programs can compute on. Its contract is structural — geometry that holds up, physics that respects Newton's laws — not merely visual.
Open
- · If geometry, physics, and dynamics are the shared substrate, why don't strong renderers automatically yield strong simulators in practice?
- · What benchmark would distinguish structural fidelity from visual fidelity in current systems?
- · Can a single model be trained end-to-end to serve all three roles, or are the contracts in tension?
Pipeline
- source kind
- url
- generated by
- anthropic+voyage
- candidates
- 24 (selected 5)
- embeddings
- voyage-3.5
Coverage
100% covered
Each block is one paragraph of the source. Darker means the decomposition captures it well; lighter means it was left out — the part of the document the summary doesn’t cover.
Considered candidates (19)
Below top-k · 15
- claimPlanners output actions and close the perception-action loopc 0.85
Given an observation and a goal, a planner decides what to do next — the inverse of the renderer. VLAs, model-based systems, and the new World Action Models are all attempts at planners for unstructured environments.
- implicationThe endpoint is a single unified world modelc 0.85
The logical destination is one foundation model that renders photorealistic views, produces physically accurate structure, and plans action sequences, switching output modalities to fit the downstream consumer.
- context"World model" has become one of AI's most overloaded termsc 0.70
Computer vision, robotics, reinforcement learning, and generative AI all claim to build world models, but each means something quite different. A video model producing impossible flames, a language model improvising a game, and a physics engine simulating combustion all share the label.
- caveatUnifying the three creates internal tensionsc 0.70
Renderers are awash in data while simulators and planners face acute shortages of 3D assets and robot demonstrations. Optimizing for visual beauty can sacrifice the precision a robot or high-fidelity simulation needs, and reconciling this inside one architecture is the defining open problem.
- implicationThree industries are converging into one bet on spatial intelligencec 0.70
Three threads, each already driving multi-billion-dollar industries, are beginning to behave like one. Their collapse will reshape the relationship between machine intelligence and the physical world.
- claimWorld models learn the statistical structure of space and timec 0.60
Where language models capture the statistical structure of text, world models learn how light falls on surfaces, how scenes look from unseen angles, and how objects respond to physical forces. The physical world runs on a different substrate than language.
- caveatPlanner demos overstate what robots can actually doc 0.60
The impressive robotic demos of the last two years have almost all been confined to laboratory setups with narrow object sets and short task horizons. None have been validated at the complexity and duration real deployment demands.
- caveatSimulation's hardest problems are data and physical fidelityc 0.60
3D data with explicit geometry and physical annotations is orders of magnitude scarcer than internet video. The sim-to-real gap persists, generative geometry can hide self-intersections or wrong scale, and multi-physics interactions remain enormously expensive.
- exampleMarble outputs splats and collision meshes from one modelc 0.55
World Labs' Marble takes multimodal prompts and generates explorable 3D environments, outputting Gaussian splats for visual exploration alongside collision meshes a physics engine can operate on. It dissolves the boundary between renderer and simulator.
- contextState, observation, and action are distinct in the roboticist's sensec 0.50
State is the complete underlying reality of the world — every position, velocity, and property — never directly visible. Observations are an agent's partial view of state, and actions are what it does in response.
- caveatRenderers carry no explicit 3D understandingc 0.50
A drone shot may look flawless from above, but try to drive through the city below and the buildings fall apart. The model produces appearance, not structure.
- contextSimulators serve two consumers at oncec 0.50
Architects, designers, and filmmakers need accuracy beyond visual plausibility, while RL agents, robot controllers, and autonomous vehicles use simulators as training grounds for scenarios too dangerous or expensive to run in reality.
- evidenceSimulation's commercial surface is enormousc 0.45
NVIDIA's Omniverse alone targets what the company estimates as more than a trillion dollars of addressable market across factories, warehouses, digital twins, and supply chains. Robotics training, AV testing, architecture, engineering, and drug discovery all depend on something simulation-shaped.
- exampleVideo renderers can be repurposed as planning backbonesc 0.40
Recent robotics work has shown that a pretrained video renderer can serve as the backbone for joint world-and-action prediction, letting one model imagine what will happen and what to do next.
- contextCommercial bets on planners are large despite the gapc 0.35
Well-funded entrants are racing to ship general-purpose planning systems while infrastructure players position planning atop broader simulation stacks. A robot that can plan is a robot that can work.
Redundant with selected · 4
- claimGeometry, physics, and dynamics are the world itself, not an abstraction of itc 0.85 · sim 0.88
Language is an abstraction of the world and pixels are a projection of it, but geometry, physics, and dynamics are the world. A simulator must work at that level — the structural backbone from which both appearance and action consequences can be derived.
overlapped with: Simulators output state with structural fidelity
- claimThe three categories are starting to blendc 0.85 · sim 0.82
The most important pattern in the field right now is convergence. Renderers are becoming action-conditioned, simulators are becoming controllable and editable, and planners are deliberating rather than reacting.
overlapped with: The three categories share one underlying knowledge of the world
- implicationSimulation mastery generalizes downward to rendering and planningc 0.80 · sim 0.86
A model that masters simulation can project its understanding into pixels for humans and into action predictions for embodied agents. A model that masters only rendering or only planning cannot do either.
overlapped with: The three categories share one underlying knowledge of the world
- evidenceRenderers are commercially mature but bounded by their visual contractc 0.55 · sim 0.85
Image- and text-to-video products are scaling rapidly, and Nano Banana has put renderer-quality generation in the hands of hundreds of millions of users. But these systems optimize for plausibility, not accuracy — beautiful outputs that cannot be trusted to design a building or train a robot.
overlapped with: Renderers output observations optimized for human eyes
Janitor
Non-content spans (acknowledgements, references, footnotes, headers, boilerplate) are dropped before the decomposition runs.
- total spans
- 28
- kept
- 26
- dropped
- 2
- content · 26
- noise · 2