Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models

University of Freiburg, Germany
Img 1

Abstract

Existing world models for autonomous driving struggle with long-horizon generation and generalization to challenging scenarios. In this work, we develop a model using simple design choices, and without additional supervision or sensors, such as maps, depth, or multiple cameras. We show that our model yields state-of-the-art performance, despite having only 469M parameters and being trained on 280h of video data. It particularly stands out in difficult scenarios like turning maneuvers and urban traffic. We test whether discrete token models possibly have advantages over continuous models based on flow matching. To this end, we set up a hybrid tokenizer that is compatible with both approaches and allows for a side-by-side comparison. Our study concludes in favor of the continuous autoregressive model, which is less brittle on individual design choices and more powerful than the model built on discrete tokens.

Method

Img 2

Image Tokenizer: The tokenizer provides two semantic and detail representation. These two representations are concatenated and fed into the image decoder and later to the world model. During training the decoder receive continuous or discrete tokens randomly in the fine-tuning phase.


World Model: To generate the next frame, the model receives either sampled Gaussian noise or fully masked tokens as the target frame, along with encoded context frames. The model progressively denoise or unmask the target frame. This iterative sampling process is repeated to generate target frame.


Inference Rollout: During inference, the world model autoregressively generates next frame. This process repeats for the desired number of frames in the rollout sequence.

Turning Scenes

Turning scenes are challenging because they require generating rapidly changing new content. While other methods often fail on such scenes (see below), Orbis generates realistic scenarios even after sharp turns, unlocking long-horizon generation.

Urban Driving

Urban driving requires realistic generation of other agents, interacting with them and with the surrounding environment.

Diverse Generation

Orbis generates diverse yet realistic scenarios for different random seeds.

Comparison to the state-of-the-art

In challenging scenarios, other approaches often generate unrealistic videos or trajectories. Orbis can easily handle sharp turns following realistic trajectories, and continue driving after them.

Est. trajectories

Orbis (ours)

GEM

Vista

Cosmos

Est. trajectories

Orbis (ours)

GEM

Vista

Cosmos

BibTeX

@article{orbis2025,
  author    = {Mousakhan, Arian and Mittal, Sudhanshu and Galesso, Silvio and Farid, Karim and Brox, Thomas},
  title     = {Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models},
  journal   = {},
  year      = {2025},
}