LiTo: Surface Light Field Tokenization

📝 Paper Summary

3D Generative Models Latent 3D Representations Neural Rendering

LiTo introduces a 3D latent representation that encodes the surface light field into compact tokens, enabling the generation of 3D objects with realistic view-dependent effects like specular highlights.

Core Problem

Most existing latent 3D representations capture only geometry or view-independent diffuse color, failing to represent realistic material effects like reflections and highlights.

Why it matters:

Realistic objects have complex materials (smooth, rough, translucent) that appear different from different viewing angles
Current generative models struggle to produce photorealistic assets because they simplify appearance to diffuse textures
Accurate light-matter interaction is essential for high-quality 3D asset creation in gaming, VR, and design

Concrete Example: When generating a shiny ceramic vase from an image, existing methods like TRELLIS produce a matte surface where reflections are 'baked in' as static texture. LiTo generates a surface where the specular highlight moves correctly as the camera rotates.

Key Novelty

Surface Light Field Tokenization (LiTo)

Encodes random subsamples of a surface light field (points + view direction + radiance) into a set of latent vectors rather than just encoding geometry or static color
Uses a specialized encoder to interpolate missing light field samples, allowing the model to learn a continuous representation of view-dependent appearance from sparse inputs
Decodes into 3D Gaussians with higher-order spherical harmonics to explicitly render complex lighting effects like Fresnel reflections

Architecture

The complete pipeline: Surface Light Field Sampling -> Encoder -> Latent Representation -> Decoders (Geometry & Gaussian)

Evaluation Highlights

Outperforms TRELLIS and TripoSR on visual quality metrics (LPIPS, PSNR) for single-image reconstruction
Achieves higher input fidelity than state-of-the-art methods while maintaining geometric accuracy
Demonstrates capability to generate view-dependent effects (specular highlights) that move consistently with camera viewpoint, unlike baseline methods

Breakthrough Assessment

8/10

Significant step forward in 3D generative modeling by successfully integrating view-dependent appearance into a latent space, addressing a major limitation of current geometry-focused or diffuse-only methods.

⚙️ Technical Details

Problem Definition

Setting: Learning a compact latent representation for object-centric 3D scenes that captures both geometry and surface light fields

Inputs: A set of samples from the surface light field X = {(x_i, d_i, c_i)} containing surface position, view direction, and color

Outputs: A set of latent tokens S used to reconstruct geometry via flow matching and appearance via 3D Gaussian Splatting

Pipeline Flow

Surface Light Field Sampling (multiview rendering)
Encoder (Perceiver IO with K-NN patchification)
Latent Flow Matching (Generative Model)
Decoders (Geometry Flow Matcher + Gaussian Renderer)

System Modules

Surface Light Field Encoder

Compresses dense light field samples into compact latent tokens

Model or implementation: Perceiver IO with custom K-NN patchification

Latent Flow Matching Model

Generates the distribution of 3D latents conditioned on a single input image

Model or implementation: Diffusion Transformer (DiT) initialized with zero positional encoding

Geometry Decoder (Decoding)

Reconstructs 3D surface geometry from latents

Model or implementation: Flow-matching velocity decoder (MLP-based)

Gaussian Decoder (Decoding)

Reconstructs view-dependent appearance

Model or implementation: Perceiver IO decoding to 3D Gaussian parameters

Novel Architectural Elements

Joint encoding of geometry and view-dependent appearance (surface light field) into a unified latent space via random subsampling
K-Nearest Neighbor 'patchification' strategy to apply transformer attention to unstructured 3D point cloud data
Decoding pipeline that splits into a flow-matching geometry branch and a higher-order spherical harmonic Gaussian branch for appearance

Modeling

Base Model: Custom Perceiver IO and DiT architectures

Training Method: Two-stage training: (1) Autoencoder training, (2) Latent Flow Matching training

Objective Functions:

Purpose: Ensure the latent representation captures accurate 3D geometry.

Formally: Flow matching loss on velocity field estimation L_geo = E[||v_t - V_theta(x_t, t)||^2].
Purpose: Ensure the latent representation captures view-dependent appearance.

Formally: Rendering loss comparing rendered Gaussians to ground truth images L_rgb = ||I_est - I_gt||_1 + lambda * L_LPIPS(I_est, I_gt).

Training Data:

Synthetic object datasets rendered into 150 multiview RGBD images per object
160 million light field samples per object, subsampled to N=2^20 for encoder input

Key Hyperparameters:

latent_tokens_k: 8192
latent_dim_d: 32
input_samples_N: 1048576
+ 5 more
batch_size: 256
encoder_params: 59.2 million
flow_model_params: 623 million
training_iterations_tokenizer: 90000
training_iterations_flow: 600000

Compute: Tokenizer: 9 days on 64 GPUs. Flow Model: 20 days on 128 H100 GPUs.

Comparison to Prior Work

vs. TRELLIS: LiTo models view-dependent appearance (highlights) whereas TRELLIS averages features into view-independent diffuse color. LiTo also supports continuous coordinate transformations during training.
vs. TripoSR: LiTo provides a generative latent space for creating new variations, whereas TripoSR is strictly a reconstruction model.
vs. 3DTopia-XL: LiTo learns directly from RGBD images without requiring watertight meshes or expensive pre-optimization steps.
+ 1 more
vs. Shap-E [not cited in paper]: Shap-E encodes objects into implicit function parameters; LiTo encodes explicit surface light field samples for higher fidelity textures.

Limitations

High computational cost for training (requires 128 H100s for 20 days)
Relies on depth maps for surface light field sampling, which may be noisy in real-world data
Geometry and appearance are decoded by separate heads, potentially leading to minor inconsistencies if not perfectly synchronized

Reproducibility

Code availability is not provided. The paper describes architectures (Perceiver IO, DiT) and hyperparameters in detail, but relies on large-scale proprietary or unspecified training datasets ('synthetic object datasets'), making exact reproduction difficult without access to the data.

📊 Experiments & Results

Evaluation Setup

Single-view 3D reconstruction and generation

Benchmarks:

Google Scanned Objects (GSO) (3D Reconstruction)

Metrics:

LPIPS (Perceptual similarity)
PSNR (Pixel-level accuracy)
Chamfer Distance (Geometry accuracy)
F-Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
LiTo demonstrates superior visual reconstruction quality compared to baselines on the Google Scanned Objects dataset, particularly in perceptual metrics.
Google Scanned Objects	LPIPS	0.085	0.062	-0.023
Google Scanned Objects	PSNR	24.12	26.45	+2.33
Google Scanned Objects	Chamfer Distance	0.042	0.045	+0.003

Experiment Figures

Qualitative comparison of generated objects against baselines like TRELLIS and TripoSR

Main Takeaways

Incorporating view-dependent light field information significantly improves visual quality (LPIPS/PSNR) over diffuse-only baselines.
The method successfully disentangles geometry and appearance, allowing for high-fidelity rendering of specular highlights.
Geometric accuracy (Chamfer Distance) is comparable to state-of-the-art geometry-focused methods, showing that adding complex appearance does not degrade shape capabilities.
The generative model can synthesize consistent 3D assets from single images that respect the lighting conditions of the input.

📚 Prerequisite Knowledge

Prerequisites

Understanding of 3D Gaussian Splatting and Spherical Harmonics
Familiarity with Flow Matching for generative modeling
Knowledge of Transformer architectures (specifically Perceiver IO)

Key Terms

surface light field: A function describing the radiance (light color/intensity) leaving every point on a surface in every possible direction

spherical harmonics: A set of basis functions used to represent functions on a sphere; used here to model how color changes with viewing angle

flow matching: A generative modeling technique that learns a velocity field to transform a simple prior distribution into a complex data distribution

Perceiver IO: A transformer architecture designed to handle arbitrary input and output arrays, scaling linearly with input size

Gaussian Splatting: A rendering technique representing scenes as a cloud of 3D Gaussians that are rasterized to form an image

Fresnel reflection: A physical phenomenon where surface reflectivity increases at glancing angles

LPIPS: Learned Perceptual Image Patch Similarity—a metric measuring how similar two images look to human perception

PSNR: Peak Signal-to-Noise Ratio—a standard metric for measuring the quality of image reconstruction