No Pose at All

Self-Supervised Pose-Free 3D Gaussian Splatting
from Sparse Views

Imperial College London
ICCV 2025 Highlight

Qualitative comparison on DTU.

Qualitative comparison on RE10K.

Qualitative comparison on RE10K.

Qualitative comparison on ACID.

Qualitative comparison on ACID.

Abstract

We introduce SPFSplat, an efficient framework for 3D Gaussian splatting from sparse multi-view images, requiring no ground-truth poses during training and inference. It can simultaneously predict Gaussians and camera poses from unposed images in a canonical space through a single feed-forward step. During training, the estimated poses at target views are used to enforce a rendering loss against ground-truth images, while the estimated input-view poses enforce pixel-aligned Gaussian representations via a reprojection loss. This pose-free training paradigm and efficient one-step feed-forward inference makes SPFSplat well-suited for practical applications. Despite the absence of pose supervision, our self-supervised SPFSplat achieves state-of-the-art (SOTA) performance in novel view synthesis (NVS), even under significant viewpoint changes. Furthermore, it surpasses recent methods trained with geometry priors in relative pose estimation, demonstrating its effectiveness in both 3D scene reconstruction and camera pose learning.

Methodology

SPFSplat consists of four main components: an encoder, a decoder, a pose head, and Gaussian prediction heads. These specialized heads are integrated into a shared ViT backbone, simultaneously predicting Gaussian centers, additional Gaussian parameters, and camera poses from unposed images in a canonical space, where the first input view serves as the reference. Only the context-only branch (above) is used during inference, while the context-with-target branch (below) is employed exclusively during training to estimate target poses, which are used for rendering loss supervision. Additionally, estimated context poses from both branches enforce alignment between Gaussian centers and their corresponding pixels via a reprojection loss. 3D Gaussians and poses are jointly optimized to improve geometric consistency and reconstruction quality.

Quantitative Results

Method Small Medium Large Average Time (s)
PSNR↑SSIM↑LPIPS↓ PSNR↑SSIM↑LPIPS↓ PSNR↑SSIM↑LPIPS↓ PSNR↑SSIM↑LPIPS↓
Pose-Required
pixelSplat 20.2770.7190.265 23.7260.8110.180 27.1520.8800.121 23.8590.8080.184 0.152
MVSplat 20.3710.7250.250 23.8080.8140.172 27.4660.8850.115 24.0120.8120.175 0.059
Supervised Pose-Free
CoPoNeRF 17.3930.5850.462 18.8130.6160.392 20.4640.6520.318 18.9380.6190.388 -
Splatt3R 17.7890.5820.375 18.8280.6070.330 19.2430.5930.317 18.6880.3370.596 0.042
NoPoSplat* 22.5140.7840.210 24.8990.8390.160 27.4110.8830.119 25.0330.8380.160 0.042
Self-supervised Pose-Free
SelfSplat 14.8280.5430.469 18.8570.6790.328 23.3380.7980.208 19.1520.6800.328 0.101
PF3plat 18.3580.6680.298 20.9530.7410.231 23.4910.7950.179 21.0420.7390.233 0.848
SPFSplat 22.8970.7920.201 25.3340.8470.153 27.9470.8940.110 25.4840.8470.153 0.044
SPFSplat* 23.1780.7960.200 25.6950.8530.151 28.3770.8990.111 25.8450.8520.152 0.044

Table 1. Performance comparison of NVS on RE10K. The best and second-best results are highlighted. * denotes evaluation-time pose alignment (EPA) strategy.

Only the time for 3D Gaussian reconstruction is reported.

Method Small Medium Large Average
PSNR↑SSIM↑LPIPS↓ PSNR↑SSIM↑LPIPS↓ PSNR↑SSIM↑LPIPS↓ PSNR↑SSIM↑LPIPS↓
Pose-Required
pixelSplat 22.0880.6550.284 25.5250.7770.197 28.5270.8540.139 25.8890.7800.194
MVSplat 21.4120.6400.290 25.1500.7720.198 28.4570.8540.137 25.5610.7750.195
Supervised Pose-Free
CoPoNeRF 18.6510.5510.485 20.6540.5950.418 22.6540.6520.343 20.9500.6060.406
Splatt3R 17.4190.5010.434 18.2570.5140.405 18.1340.5080.395 18.0600.5100.407
NoPoSplat* 23.0870.6850.258 25.6240.7770.193 28.0430.8410.144 25.9610.7810.189
Self-supervised Pose-Free
SelfSplat 18.3010.5680.408 21.3750.6760.314 25.2190.7920.214 22.0890.6940.298
PF3plat 18.1120.5370.376 20.7320.6150.307 23.6070.7100.228 21.2060.6320.293
SPFSplat 22.6670.6650.262 25.6200.7730.192 28.6070.8560.136 26.0700.7810.186
SPFSplat* 23.6760.7080.243 26.3510.8010.182 29.1700.8700.131 26.7960.8070.176

Table 2. Performance comparison of NVS on the ACID dataset. The best and second best results are highlighted. * denotes evaluation-time pose alignment (EPA) strategy.

Method ACID DTU
PSNR↑SSIM↑LPIPS↓ PSNR↑SSIM↑LPIPS↓
pixelSplat 25.4770.7700.207 15.0670.5390.341
MVSplat 25.5250.7730.199 14.5420.5370.324
NoPoSplat* 25.7640.7760.199 17.8990.6290.279
SelfSplat 22.2040.6860.316 13.2490.4340.441
PF3plat 20.7260.6100.308 12.9720.4070.464
SPFSplat 25.9650.7810.190 16.5500.5790.270
SPFSplat* 26.6970.8060.181 18.2970.6600.255

Table 3. Cross-dataset generalization on the NVS task. All methods are trained on RE10K and evaluated in a zero-shot setting on ACID and DTU.

The best and second best results are highlighted. * denotes evaluation-time pose alignment (EPA) strategy.

Method RE10K ACID
5° ↑10° ↑20° ↑ 5° ↑10° ↑20° ↑
SP + SG 0.2340.4060.569 0.2280.3630.500
DUSt3R 0.3360.5410.702 0.1180.2790.470
MASt3R 0.2810.4940.671 0.1380.3120.507
NoPoSplat 0.5710.7280.833 0.3350.4970.645
SelfSplat 0.2070.3920.576 0.2050.3630.531
PF3plat 0.1870.3980.636 0.2030.3530.541
SPFSplat (PnP) 0.6130.7540.845 0.3550.5160.658
SPFSplat 0.6170.7550.845 0.3640.5200.662

Table 4.Pose estimation performance on RE10K and ACID datasets. We evaluate on ACID using the model trained only on RE10K for all splat-based methods.

The best and second best results are highlighted.

Qualitative Results

Qualitative comparison on RE10K (top three rows) and ACID (bottom row).

Qualitative comparison on RE10K ACID (top two rows) and RE10K DTU (bottom two rows).

Qualitative comparisons of 3D Gaussians and rendered results.

BibTeX

@article{huang2025spfsplat,
      title={No Pose at All: Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views},
      author={Huang, Ranran and Krystian, Mikolajczyk},
      journal={},
      year={2025}
    }