Abstract
Although prevailing camera-controlled video generation models can produce cinematic results, lifting them
directly to
the generation of 3D-consistent and high-fidelity time-synchronized multi-view videos remains challenging,
which is a
pivotal capability for taming 4D worlds. Some works resort to data augmentation or test-time optimization,
but these
strategies are constrained by limited model generalization and scalability issues. To this end, we propose
ChronosObserver, a training-free method including World State Hyperspace to represent the spatiotemporal
constraints of
a 4D world scene, and Hyperspace Guided Sampling to synchronize the diffusion sampling trajectories of
multiple views
using the hyperspace. Experimental results demonstrate that our method achieves high-fidelity and
3D-consistent
time-synchronized multi-view videos generation without training or fine-tuning for diffusion models.