AdaViewPlanner: Adapting Video Diffusion Models for Viewpoint Planning in 4D Scenes
TL;DR: We propose AdaViewPlanner, a framework that adapts pre-trained text-to-video models for automatic viewpoint planning in 4D scenes. Given 4D content and text prompts describing the scene context and desired camera motion, our model can generate coordinate-aligned camera pose sequences along with corresponding video visualizations. Leveraging the priors of video generation models, AdaViewPlanner demonstrates strong capability for smooth, diverse, instruction-following, and human-centric viewpoint planning in 4D scenes.

Demos
(Left) 4D human motion input in canonical space; (Middle) Generated cinematic video where the human pose follows the 4D motion conditions and the camera viewpoints and trajectories are model-planned; (Right) Generated coordinate-aligned camera pose sequence.
Effect of Random Seeds and Prompts on Camera Generation
Random Seed
Scene Context Prompt
Camera Movement Prompt
Comparisons

Compared with other methods (E.T. [1], DanceCam* [2]), our model generates smoother trajectories that better follow instructions, while also exhibiting a cinematographic style centered on human actions.
[1] Courant, Robin, et al. "ET the Exceptional Trajectories: Text-to-camera-trajectory generation with character awareness." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024.
[2] Wang, Zixuan, et al. "Dancecamera3d: 3d camera movement synthesis with music and dance." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
Additional results

The results demonstrate the diversity of the generated trajectories and the model’s ability to follow camera text instructions.
Ablation results

Columns 1–4 illustrate the reprojection of 4D human skeletons using the estimated camera parameters, while Column 5 displays the corresponding rendered results in 3D space. We present two ablation variants. w/o Motion refers to the removal of the motion condition in Stage II. Without skeletal references, the model generates smooth but misfocused trajectories, resulting in erroneous viewpoints. Relative Cam denotes a variant trained without motion conditioning that estimates only relative camera poses, requiring post-processing to align camera and human motion coordinates. While this approach alleviates the viewpoint misalignment, it lacks motion scale awareness, ultimately causing noticeable temporal and spatial inconsistencies.
Abstract
Recent Text-to-Video (T2V) models have demonstrated powerful capability in visual simulation of real-world geometry and physical laws, indicating its potential as implicit world models. Inspired by this, we explore the feasibility of leveraging the video generation prior for viewpoint planning from given 4D scenes, since videos internally accompany dynamic scenes with natural viewpoints. To this end, we propose a two-stage paradigm to adapt pre-trained T2V models for viewpoint prediction, in a compatible manner. First, we inject the 4D scene representation into the pre-trained T2V model via an adaptive learning branch, where the 4D scene is viewpoint-agnostic and the conditional generated video embeds the viewpoints visually. Then, we formulate viewpoint extraction as a hybrid-condition guided camera extrinsic denoising process. Specifically, a camera extrinsic diffusion branch is further introduced onto the pre-trained T2V model, by taking the generated video and 4D scene as input. Experimental results show the superiority of our proposed method over existing competitors, and ablation studies validate the effectiveness of our key technical designs. To some extent, this work proves the potential of video generation models toward 4D interaction in real world.
Method
Training over vast amounts of film footage, video generation models can synthesize various dynamic scenes with rich cinematic skills. Based on this observation, we aim to leverage this capability by repurposing these models as virtual cinematographers to design professional camera trajectories for given 4D scenes. For simplicity, we explore and validate this concept by only considering moving human in 4D scenes, which serves as the major context of interest in applications. The overview of the model is depicted below.

(a) Stage I model for motion-conditioned cinematic video generation: a pose encoder processes human motion data M from 4D scenes and integrates it with video tokens via spatial motion attention to produce videos with cinematic camera movements. Camera parameters used for guidance are denoted as C. (b) Stage II model: three branches for video, camera, and human motion are combined in an MMDiT framework to extract camera pose.
BibTeX
@article{li2025adaviewplanner,
title={AdaViewPlanner: Adapting Video Diffusion Models for Viewpoint Planning in 4D Scenes},
author={Yu Li, Menghan Xia, Gongye Liu, Jianhong Bai, Xintao Wang, Conglang Zhang, Yuxuan Lin, Ruihang Chu, Pengfei Wan, Yujiu Yang},
journal={arXiv preprint arXiv:2510.10670},
year={2025}
}