Overview of NAS3R. Unconstrained images are patchified into visual tokens and concatenated with a learnable camera token for camera prediction. A masked decoder regulates cross-view interactions and prevents target-to-context leakage. Refined context tokens are then processed by the Gaussian head to predict Gaussian parameters, while a depth head estimates depth maps that are lifted into 3D Gaussian centers using the predicted context poses. The predicted target poses are finally used to render novel views, providing photometric supervision for end-to-end training.
@article{huang2026nas3r,
title = {From None to All: Self-Supervised 3D Reconstruction via Novel View Synthesis},
author = {Ranran Huang and Weixun Luo and Ye Mao and Krystian Mikolajczyk},
journal = {arXiv preprint arXiv:2603.27455},
year = {2026}
}