Logo

SPFSplatV2

Efficient Self-Supervised Pose-Free 3D Gaussian Splatting
from Sparse Views

Imperial College London

Qualitative comparison on ACID.

Qualitative comparison on RE10K.

Qualitative comparison on RE10K.

Qualitative comparison on RE10K -> ACID.

Qualitative comparison on RE10K -> DL3DV.

Qualitative comparison on RE10K -> ScanNet++.

Qualitative comparison on RE10K -> DTU.

Abstract

We introduce SPFSplatV2, an efficient feed-forward framework for 3D Gaussian splatting from sparse multi-view images, requiring no ground-truth poses during training and inference. It employs a shared feature extraction backbone, enabling simultaneous prediction of 3D Gaussian primitives and camera poses in a canonical space from unposed inputs. A masked attention mechanism is introduced to efficiently estimate target poses during training, while a reprojection loss enforces pixel-aligned Gaussian primitives, providing stronger geometric constraints. We further demonstrate the compatibility of our training framework with different reconstruction architectures, resulting in two model variants. Remarkably, despite the absence of pose supervision, our method achieves state-of-the-art performance in both in-domain and out-of-domain novel view synthesis, even under extreme viewpoint changes and limited image overlap, and surpasses recent methods that rely on geometric supervision for relative pose estimation. By eliminating dependence on ground-truth poses, our method offers the scalability to leverage larger and more diverse datasets.

Methodology

SPFSplatV2 employs a shared backbone with three specialized heads simultaneously predicts Gaussian centers, additional Gaussian parameters, and camera poses from unposed images in a canonical space, with the first input view as the reference. Encoder tokens, concatenated with a learnable pose token and an optional embedding of ground-truth intrinsics, are fed into the decoder, which employs masked attention to prevent context tokens from attending to target tokens, ensuring Gaussian reconstruction remains independent of target-view information. The 3D Gaussians are optimized via a rendering loss using the predicted target poses, while a reprojection loss enforces alignment between Gaussian centers and their corresponding pixels using the predicted context poses. By jointly optimizing Gaussians and camera poses, the pipeline enhances geometric consistency and improves reconstruction quality.

Quantitative Results

Method Small Medium Large Average
PSNR↑SSIM↑LPIPS↓ PSNR↑SSIM↑LPIPS↓ PSNR↑SSIM↑LPIPS↓ PSNR↑SSIM↑LPIPS↓
Pose-Required
pixelSplat 20.2770.7190.265 23.7260.8110.180 27.1520.8800.121 23.8590.8080.184
MVSplat 20.3710.7250.250 23.8080.8140.172 27.4660.8850.115 24.0120.8120.175
Supervised Pose-Free
CoPoNeRF 17.3930.5850.462 18.8130.6160.392 20.4640.6520.318 18.9380.6190.388
Splatt3R 17.7890.5820.375 18.8280.6070.330 19.2430.5930.317 18.6880.3370.596
NoPoSplat* 22.5140.7840.210 24.8990.8390.160 27.4110.8830.119 25.0330.8380.160
Self-Supervised Pose-Free
SelfSplat 14.8280.5430.469 18.8570.6790.328 23.3380.7980.208 19.1520.6800.328
PF3plat 18.3580.6680.298 20.9530.7410.231 23.4910.7950.179 21.0420.7390.233
SPFSplat 22.8970.7920.201 25.3340.8470.153 27.9470.8940.110 25.4840.8470.153
SPFSplat* 23.1780.7960.200 25.6950.8530.151 28.3770.8990.111 25.8450.8520.152
SPFSplatV2 23.1230.8000.195 25.5420.8530.149 28.1430.8970.110 25.6930.8530.149
SPFSplatV2* 23.4560.8060.193 26.0300.8620.145 28.6820.9050.107 26.1570.8610.146
SPFSplatV2-L 23.1380.8040.184 25.5180.8560.136 28.0810.8990.099 25.6680.8550.137
SPFSplatV2-L* 23.3290.8040.183 25.8630.8610.134 28.4560.9030.098 25.9830.8590.136

Table 1. Performance comparison of NVS on RE10K. The best and second-best results are highlighted. * denotes evaluation-time pose alignment (EPA) strategy.

Only the time for 3D Gaussian reconstruction is reported.

Method Small Medium Large Average
PSNR↑SSIM↑LPIPS↓ PSNR↑SSIM↑LPIPS↓ PSNR↑SSIM↑LPIPS↓ PSNR↑SSIM↑LPIPS↓
Pose-Required
pixelSplat 22.0880.6550.284 25.5250.7770.197 28.5270.8540.139 25.8890.7800.194
MVSplat 21.4120.6400.290 25.1500.7720.198 28.4570.8540.137 25.5610.7750.195
Supervised Pose-Free
CoPoNeRF 18.6510.5510.485 20.6540.5950.418 22.6540.6520.343 20.9500.6060.406
Splatt3R 17.4190.5010.434 18.2570.5140.405 18.1340.5080.395 18.0600.5100.407
NoPoSplat* 23.0870.6850.258 25.6240.7770.193 28.0430.8410.144 25.9610.7810.189
Self-Supervised Pose-Free
SelfSplat 18.3010.5680.408 21.3750.6760.314 25.2190.7920.214 22.0890.6940.298
PF3plat 18.1120.5370.376 20.7320.6150.307 23.6070.7100.228 21.2060.6320.293
SPFSplat 22.6670.6650.262 25.6200.7730.192 28.6070.8560.136 26.0700.7810.186
SPFSplat* 23.6760.7080.243 26.3510.8010.182 29.1700.8700.131 26.7960.8070.176
SPFSplatV2 22.9440.6790.255 25.8490.7840.187 28.7660.8620.133 26.2840.7910.182
SPFSplatV2* 23.6350.7000.247 26.3560.7980.182 29.2230.8710.129 26.8090.8040.176
SPFSplatV2-L 23.6400.7060.225 26.2720.8010.166 28.9380.8680.120 26.6740.8060.162
SPFSplatV2-L* 23.9370.7100.224 26.4890.8030.165 29.1880.8710.118 26.9170.8090.160

Table 2. Performance comparison of NVS on the ACID dataset. The best and second best results are highlighted. * denotes evaluation-time pose alignment (EPA) strategy.

Method ACID DTU DL3DV ScanNet++
PSNR↑SSIM↑LPIPS↓ PSNR↑SSIM↑LPIPS↓ PSNR↑SSIM↑LPIPS↓ PSNR↑SSIM↑LPIPS↓
Pose-Required
pixelSplat 25.4770.7700.207 15.0670.5390.341 18.6880.5820.354 18.4220.7200.278
MVSplat 25.5250.7730.199 14.5420.5370.324 17.7860.5450.357 17.1380.6870.297
Supervised Pose-Free
NoPoSplat* 25.7640.7760.199 17.8990.6290.279 19.0310.5600.365 22.1360.7980.232
Self-Supervised Pose-Free
SelfSplat 22.2040.6860.316 13.2490.4340.441 15.0470.4100.498 13.2770.5380.534
PF3plat 20.7260.6100.308 12.9720.4070.464 15.7730.4580.417 16.4710.6880.303
SPFSplat 25.9650.7810.190 16.5500.5790.270 19.1720.5730.315 19.9710.7380.265
SPFSplat* 26.6970.8060.181 18.2970.6600.255 19.4940.5740.319 22.3120.7930.243
SPFSplatV2 26.2200.7890.185 16.7930.5840.265 19.4390.5840.304 20.9190.7710.243
SPFSplatV2* 26.8020.8050.179 18.5060.6630.246 19.9780.6070.302 22.7760.8120.227
SPFSplatV2-L 26.3610.7960.169 17.7390.6530.228 19.7430.6130.277 21.7960.8110.200
SPFSplatV2-L* 26.6800.8020.166 19.3160.6710.229 20.1080.6150.279 23.0720.8200.199

Table 3. Cross-dataset generalization on the NVS task. All methods are trained on RE10K and evaluated in a zero-shot setting on ACID and DTU.

The best and second best results are highlighted. * denotes evaluation-time pose alignment (EPA) strategy.

Method RE10K ACID DL3DV ScanNet++
5°↑ 10°↑ 20°↑ 5°↑ 10°↑ 20°↑ 5°↑ 10°↑ 20°↑ 5°↑ 10°↑ 20°↑
SfM
SP + SG0.2340.4060.569 0.2280.3630.500 0.2240.3720.492 0.0870.1510.248
DUSt3R0.3360.5410.702 0.1180.2790.470 0.2750.4900.686 0.1090.2840.500
MASt3R0.2810.4940.672 0.1380.3120.507 0.3320.5930.772 0.1390.3360.549
VGGT0.2570.4740.658 0.1420.3040.486 0.3560.6090.784 0.1560.3110.514
Pose-Free View Synthesis
NoPoSplat 0.5710.7270.833 0.3350.4960.644 0.4700.6460.762 0.2070.4030.641
PF3plat 0.1870.3980.613 0.0600.1650.340 0.1180.2810.479 0.0580.2040.415
SPFSplat 0.6170.7550.845 0.3640.5200.662 0.2830.4610.622 0.0980.1880.374
SPFSplat (PnP) 0.6130.7540.845 0.3550.5160.658 0.2790.4640.626 0.1200.2260.408
SPFSplatV2 0.6380.7760.863 0.3870.5410.672 0.3690.5340.694 0.1440.2810.487
SPFSplatV2 (PnP) 0.6410.7770.864 0.3740.5330.667 0.3750.5420.700 0.1110.2500.463
SPFSplatV2-L 0.6450.7800.864 0.3790.5390.671 0.4200.5820.711 0.1840.4000.630
SPFSplatV2-L (PnP) 0.6570.7860.867 0.3750.5350.668 0.4290.5870.716 0.1830.4000.627

Table 4. Pose estimation performance on RE10K, ACID, DL3DV, and ScanNet++ datasets. We use the model trained only on RE10K for all splat-based methods.

The best and second best results are highlighted.

Qualitative Results

Qualitative comparison on RE10K (top three rows) and ACID (bottom three rows).

All methods are trained on RE10K and evaluated on ACID and DTU, DL3DV, and ScanNet++ (from top to bottom).

Qualitative comparisons of 3D Gaussians and rendered results.

BibTeX

@article{huang2025spfsplat,
      title={SPFSplatV2: Efficient Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views},
      author={Huang, Ranran and Mikolajczyk, Krystian},
      journal={arXiv preprint arXiv:2509.17246},
      year={2025}
    }