SPFSplatV2: Efficient Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views

Qualitative comparison on ACID.

Qualitative comparison on RE10K.

Qualitative comparison on RE10K -> ACID.

Qualitative comparison on RE10K -> DL3DV.

Qualitative comparison on RE10K -> ScanNet++.

Qualitative comparison on RE10K -> DTU.

Abstract

We introduce SPFSplatV2, an efficient feed-forward framework for 3D Gaussian splatting from sparse multi-view images, requiring no ground-truth poses during training and inference. It employs a shared feature extraction backbone, enabling simultaneous prediction of 3D Gaussian primitives and camera poses in a canonical space from unposed inputs. A masked attention mechanism is introduced to efficiently estimate target poses during training, while a reprojection loss enforces pixel-aligned Gaussian primitives, providing stronger geometric constraints. We further demonstrate the compatibility of our training framework with different reconstruction architectures, resulting in two model variants. Remarkably, despite the absence of pose supervision, our method achieves state-of-the-art performance in both in-domain and out-of-domain novel view synthesis, even under extreme viewpoint changes and limited image overlap, and surpasses recent methods that rely on geometric supervision for relative pose estimation. By eliminating dependence on ground-truth poses, our method offers the scalability to leverage larger and more diverse datasets.

Methodology

SPFSplatV2 employs a shared backbone with three specialized heads simultaneously predicts Gaussian centers, additional Gaussian parameters, and camera poses from unposed images in a canonical space, with the first input view as the reference. Encoder tokens, concatenated with a learnable pose token and an optional embedding of ground-truth intrinsics, are fed into the decoder, which employs masked attention to prevent context tokens from attending to target tokens, ensuring Gaussian reconstruction remains independent of target-view information. The 3D Gaussians are optimized via a rendering loss using the predicted target poses, while a reprojection loss enforces alignment between Gaussian centers and their corresponding pixels using the predicted context poses. By jointly optimizing Gaussians and camera poses, the pipeline enhances geometric consistency and improves reconstruction quality.

Quantitative Results

Method	Small			Medium			Large			Average
Method	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓
Pose-Required
pixelSplat	20.277	0.719	0.265	23.726	0.811	0.180	27.152	0.880	0.121	23.859	0.808	0.184
MVSplat	20.371	0.725	0.250	23.808	0.814	0.172	27.466	0.885	0.115	24.012	0.812	0.175
Supervised Pose-Free
CoPoNeRF	17.393	0.585	0.462	18.813	0.616	0.392	20.464	0.652	0.318	18.938	0.619	0.388
Splatt3R	17.789	0.582	0.375	18.828	0.607	0.330	19.243	0.593	0.317	18.688	0.337	0.596
NoPoSplat^*	22.514	0.784	0.210	24.899	0.839	0.160	27.411	0.883	0.119	25.033	0.838	0.160
Self-Supervised Pose-Free
SelfSplat	14.828	0.543	0.469	18.857	0.679	0.328	23.338	0.798	0.208	19.152	0.680	0.328
PF3plat	18.358	0.668	0.298	20.953	0.741	0.231	23.491	0.795	0.179	21.042	0.739	0.233
SPFSplat	22.897	0.792	0.201	25.334	0.847	0.153	27.947	0.894	0.110	25.484	0.847	0.153
SPFSplat^*	23.178	0.796	0.200	25.695	0.853	0.151	28.377	0.899	0.111	25.845	0.852	0.152
SPFSplatV2	23.123	0.800	0.195	25.542	0.853	0.149	28.143	0.897	0.110	25.693	0.853	0.149
SPFSplatV2^*	23.456	0.806	0.193	26.030	0.862	0.145	28.682	0.905	0.107	26.157	0.861	0.146
SPFSplatV2-L	23.138	0.804	0.184	25.518	0.856	0.136	28.081	0.899	0.099	25.668	0.855	0.137
SPFSplatV2-L^*	23.329	0.804	0.183	25.863	0.861	0.134	28.456	0.903	0.098	25.983	0.859	0.136

Table 1. Performance comparison of NVS on RE10K. The best and second-best results are highlighted. * denotes evaluation-time pose alignment (EPA) strategy.

Only the time for 3D Gaussian reconstruction is reported.

Method	Small			Medium			Large			Average
Method	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓
Pose-Required
pixelSplat	22.088	0.655	0.284	25.525	0.777	0.197	28.527	0.854	0.139	25.889	0.780	0.194
MVSplat	21.412	0.640	0.290	25.150	0.772	0.198	28.457	0.854	0.137	25.561	0.775	0.195
Supervised Pose-Free
CoPoNeRF	18.651	0.551	0.485	20.654	0.595	0.418	22.654	0.652	0.343	20.950	0.606	0.406
Splatt3R	17.419	0.501	0.434	18.257	0.514	0.405	18.134	0.508	0.395	18.060	0.510	0.407
NoPoSplat^*	23.087	0.685	0.258	25.624	0.777	0.193	28.043	0.841	0.144	25.961	0.781	0.189
Self-Supervised Pose-Free
SelfSplat	18.301	0.568	0.408	21.375	0.676	0.314	25.219	0.792	0.214	22.089	0.694	0.298
PF3plat	18.112	0.537	0.376	20.732	0.615	0.307	23.607	0.710	0.228	21.206	0.632	0.293
SPFSplat	22.667	0.665	0.262	25.620	0.773	0.192	28.607	0.856	0.136	26.070	0.781	0.186
SPFSplat^*	23.676	0.708	0.243	26.351	0.801	0.182	29.170	0.870	0.131	26.796	0.807	0.176
SPFSplatV2	22.944	0.679	0.255	25.849	0.784	0.187	28.766	0.862	0.133	26.284	0.791	0.182
SPFSplatV2^*	23.635	0.700	0.247	26.356	0.798	0.182	29.223	0.871	0.129	26.809	0.804	0.176
SPFSplatV2-L	23.640	0.706	0.225	26.272	0.801	0.166	28.938	0.868	0.120	26.674	0.806	0.162
SPFSplatV2-L^*	23.937	0.710	0.224	26.489	0.803	0.165	29.188	0.871	0.118	26.917	0.809	0.160

Table 2. Performance comparison of NVS on the ACID dataset. The best and second best results are highlighted. * denotes evaluation-time pose alignment (EPA) strategy.

Method	ACID			DTU			DL3DV			ScanNet++
Method	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓
Pose-Required
pixelSplat	25.477	0.770	0.207	15.067	0.539	0.341	18.688	0.582	0.354	18.422	0.720	0.278
MVSplat	25.525	0.773	0.199	14.542	0.537	0.324	17.786	0.545	0.357	17.138	0.687	0.297
Supervised Pose-Free
NoPoSplat^*	25.764	0.776	0.199	17.899	0.629	0.279	19.031	0.560	0.365	22.136	0.798	0.232
Self-Supervised Pose-Free
SelfSplat	22.204	0.686	0.316	13.249	0.434	0.441	15.047	0.410	0.498	13.277	0.538	0.534
PF3plat	20.726	0.610	0.308	12.972	0.407	0.464	15.773	0.458	0.417	16.471	0.688	0.303
SPFSplat	25.965	0.781	0.190	16.550	0.579	0.270	19.172	0.573	0.315	19.971	0.738	0.265
SPFSplat^*	26.697	0.806	0.181	18.297	0.660	0.255	19.494	0.574	0.319	22.312	0.793	0.243
SPFSplatV2	26.220	0.789	0.185	16.793	0.584	0.265	19.439	0.584	0.304	20.919	0.771	0.243
SPFSplatV2^*	26.802	0.805	0.179	18.506	0.663	0.246	19.978	0.607	0.302	22.776	0.812	0.227
SPFSplatV2-L	26.361	0.796	0.169	17.739	0.653	0.228	19.743	0.613	0.277	21.796	0.811	0.200
SPFSplatV2-L^*	26.680	0.802	0.166	19.316	0.671	0.229	20.108	0.615	0.279	23.072	0.820	0.199

Table 3. Cross-dataset generalization on the NVS task. All methods are trained on RE10K and evaluated in a zero-shot setting on ACID and DTU.

The best and second best results are highlighted. * denotes evaluation-time pose alignment (EPA) strategy.

Method	RE10K			ACID			DL3DV			ScanNet++
Method	5°↑	10°↑	20°↑	5°↑	10°↑	20°↑	5°↑	10°↑	20°↑	5°↑	10°↑	20°↑
SfM
SP + SG	0.234	0.406	0.569	0.228	0.363	0.500	0.224	0.372	0.492	0.087	0.151	0.248
DUSt3R	0.336	0.541	0.702	0.118	0.279	0.470	0.275	0.490	0.686	0.109	0.284	0.500
MASt3R	0.281	0.494	0.672	0.138	0.312	0.507	0.332	0.593	0.772	0.139	0.336	0.549
VGGT	0.257	0.474	0.658	0.142	0.304	0.486	0.356	0.609	0.784	0.156	0.311	0.514
Pose-Free View Synthesis
NoPoSplat	0.571	0.727	0.833	0.335	0.496	0.644	0.470	0.646	0.762	0.207	0.403	0.641
PF3plat	0.187	0.398	0.613	0.060	0.165	0.340	0.118	0.281	0.479	0.058	0.204	0.415
SPFSplat	0.617	0.755	0.845	0.364	0.520	0.662	0.283	0.461	0.622	0.098	0.188	0.374
SPFSplat (PnP)	0.613	0.754	0.845	0.355	0.516	0.658	0.279	0.464	0.626	0.120	0.226	0.408
SPFSplatV2	0.638	0.776	0.863	0.387	0.541	0.672	0.369	0.534	0.694	0.144	0.281	0.487
SPFSplatV2 (PnP)	0.641	0.777	0.864	0.374	0.533	0.667	0.375	0.542	0.700	0.111	0.250	0.463
SPFSplatV2-L	0.645	0.780	0.864	0.379	0.539	0.671	0.420	0.582	0.711	0.184	0.400	0.630
SPFSplatV2-L (PnP)	0.657	0.786	0.867	0.375	0.535	0.668	0.429	0.587	0.716	0.183	0.400	0.627

Table 4. Pose estimation performance on RE10K, ACID, DL3DV, and ScanNet++ datasets. We use the model trained only on RE10K for all splat-based methods.

The best and second best results are highlighted.

Qualitative Results

Qualitative comparison on RE10K (top three rows) and ACID (bottom three rows).

All methods are trained on RE10K and evaluated on ACID and DTU, DL3DV, and ScanNet++ (from top to bottom).

Qualitative comparisons of 3D Gaussians and rendered results.

BibTeX

@article{huang2025spfsplat,
      title={SPFSplatV2: Efficient Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views},
      author={Huang, Ranran and Mikolajczyk, Krystian},
      journal={arXiv preprint arXiv:2509.17246},
      year={2025}
    }