Taming 4D World with Hyperspace Diffusion Sampling

A training-free method for temporal-synchronized multi-view video generation

Beihang University

Abstract

Although prevailing camera-controlled video generation models can produce cinematic results, lifting them directly to the generation of 3D-consistent and high-fidelity time-synchronized multi-view videos remains challenging, which is a pivotal capability for taming 4D worlds. Some works resort to data augmentation or test-time optimization, but these strategies are constrained by limited model generalization and scalability issues. To this end, we propose ChronosObserver, a training-free method including World State Hyperspace to represent the spatiotemporal constraints of a 4D world scene, and Hyperspace Guided Sampling to synchronize the diffusion sampling trajectories of multiple views using the hyperspace. Experimental results demonstrate that our method achieves high-fidelity and 3D-consistent time-synchronized multi-view videos generation without training or fine-tuning for diffusion models.

Comparison Results

Qualitative Comparisons

Ablations

w/o I.S.H.
Ours
w/o H.G.S.
Ours
w/o I.S.H. w/o H.G.S.
Ours

More Visualization Results

Citation

@article{wang2025chronosobserver,
  title={ChronosObserver: Taming 4D World with Hyperspace Diffusion Sampling},
  author={Wang, Qisen and Zhao, Yifan and Shen, Peisen and Li, Jialu and Li, Jia},
  journal={arXiv preprint arXiv:2512.01481},
  year={2025}
}