Scaling View Synthesis Transformers

1MIT, 2PlayStation
* Equal contribution

TL;DR: view synthesis transformers which achieve SoTA PSNR with 3x less FLOPs.

Scaling laws comparison
Performance vs. compute scaling comparison between SVSM and LVSM on RealEstate10K.

Abstract

Recently, geometry-free view synthesis transformers have achieved state-of-the-art results in Novel View Synthesis (NVS), outperforming traditional approaches that rely on explicit geometry modeling. However, the specific factors that govern how their performance scales with compute remain poorly understood. In this work, we conduct a rigorous analysis of the scaling laws for view synthesis transformers and elucidate a series of design choices for training compute-optimal NVS models. Most significantly, we find that an encoder–decoder architecture, which was previously found to be less scalable, can in fact be compute-optimal. We attribute the previously inferior performance of previous encoder–decoder methods to certain architectural choices and inconsistent training compute across comparisons. Across several compute levels, we demonstrate that our encoder–decoder architecture, which we call the Scalable View Synthesis Model (SVSM), scales as effectively as decoder-only models, achieves a superior performance–compute Pareto frontier, and outperforms the previous state-of-the-art on real-world NVS benchmarks with substantially reduced training compute.

Method Overview

Method overview diagram
Figure 1: Architectures of the current SOTA, the decoder-only LVSM (a) and SVSM (ours, b). We demonstrate that an encoder-decoder transformer architecture outperforms the prior state-of-the-art model while dramatically reducing the training compute budget.

Encode once, decode many times.

Decoder-only LVSM recomputes context information for every target view rendered. SVSM instead uses an encoder-decoder design: a bidirectional encoder processes the context images once into latent tokens , then a cross-attention decoder renders each target view from this fixed representation. This reduces rendering complexity from to —a significant saving when rendering many views. The tradeoff: unlike LVSM, the encoder can't discard target-irrelevant information. But SVSM's compute efficiency lets us scale up model size and training steps such that, normalized by compute budget, SVSM significantly outperforms LVSM.

Why is this better? Effective Batch Size.

Training cost scales with both the number of scenes (batch size ) and target views per scene (). We find empirically that what matters is their product—the effective batch size . Configurations with the same achieve nearly identical performance (within ±0.2 PSNR).

Effective batch size scaling law
Effective batch size scaling: configurations with same achieve nearly identical performance.

For decoder-only LVSM, compute scales as:

So there's no advantage to tuning —all configurations at fixed cost the same. In contrast, SVSM scales as:

By reducing and increasing , we achieve the same effective batch size—and performance—with lower compute. This justifies our encoder-decoder design that efficiently decodes multiple targets.

Results

Scaling Laws: compute-efficent Pareto frontier.

We evaluate our architecture rigorously by training both models on various compute budgets. We also test in several datasets and view context count settings: RE10K (2 context views), DL3DV (4 context views), and Objaverse (8 context views). In all cases, SVSM (in blue) consistently requires much less training compute to achieve the same performance as LVSM.

Scaling law: RE10K experiment
RE10K: 2 context views. SVSM achieves equal performance with 3 times less compute.
Scaling law: DL3DV experiment
DL3DV: 4 context views. PRoPE enables SVSM to match LVSM in scaling with a lower compute cost.
Scaling law: Objaverse experiment
Objaverse: 8 context views. SVSM has substantial compute advantage. Bottlenecked versions scale the same.

Qualitative Results: RE10K, DL3DV, Objaverse.

RE10K qualitative results
Qualitative comparison on RE10K dataset (2 context views).
DL3DV qualitative results
Qualitative comparison on DL3DV dataset (4 context views).
Objaverse qualitative results
Multiview consistency of SVSM outputs on Objaverse.
Novel view synthesis on RE10K dataset with full compute
SVSM (ours) on Objaverse with limited compute
LVSM (decoder-only) on Objaverse with limited compute

Citation

If you find this work useful, please cite:

@inproceedings{kim2026svsm,
  title={Scaling View Synthesis Transformers},
  author={Evan Kim and Hyunwoo Ryu and Thomas W. Mitchel and Vincent Sitzmann},
  booktitle={arXiv preprint arXiv:2602.xxxxx},
  year={2026}
}

Acknowledgements

We thank [names] for helpful discussions. This work was supported by [funding sources].