Scalable Normalizing Flows

The Idea

Normalizing flows that scale

One invertible network, one likelihood objective — a single backbone for images, video, language, and unified multimodality.

Built on the Transformer Autoregressive Flow (TARFlow), this line of work shows that normalizing flows — long overshadowed by diffusion models — are in fact capable generative models that scale. By keeping the model an exact, invertible, maximum-likelihood normalizing flow while borrowing the architecture of autoregressive Transformers, these invertible models reach high-resolution image synthesis, video world models, continuous-space language modeling, and unified multimodal generation.

A normalizing flow in one idea

x f →← f^-1 z ~ 𝒩(0, I)

A normalizing flow is a single invertible network f that maps data x to simple Gaussian noise z — and runs backward to map it home.

f^-1

Generate. Draw a z, push it back through f^-1 to a sample x.

Score. Map x to z and read off its exact likelihood — the same network, one objective.

p(x) = p₀(f(x)) · |det ∂f/∂x|

Exact likelihood

Trained by exact maximum likelihood — one clean objective. No ELBO, no noise schedule, no discretization.

Invertible & lossless

x ↔ z is exactly reversible — encoding and generation share the very same network and weights.

Continuous, end-to-end

x stays in ℝ^d throughout — no codebook, no quantization. The same machinery LLMs already run at scale.

Normalizing flows were always there — RealNVP, Glow, MAF/IAF, Flow++ — and always kept exact likelihood, but lost ground to GANs and diffusion on sample quality. The Transformer revival changes the verdict: TARFlow gets diffusion-level samples from a stand-alone flow, and the work below scales that one backbone to new modalities.

Why now

Same principle, three ingredients that finally make it scale

Classical flows leaned on hand-designed coupling layers — expressive enough for densities, but their samples stayed behind GANs and diffusion. The architecture, not the principle, was the bottleneck. Three ingredients close the gap:

Deep–shallow Transformer flow

Then: shallow stacks of affine coupling / 1×1 convs — limited capacity, hard to scale.
Now: one deep autoregressive Transformer block carries most of the capacity (acting like a language model over tokens), plus a few cheap shallow blocks with alternating scan direction for local detail — parallelizable in the inverse.

Noise-augmented training

Then: exact MLE on clean data overfits high-frequency detail and yields noisy samples.
Now: Gaussian noise augmentation during training, paired with a small post-hoc denoiser at sampling — the same trick that lets the flow produce clean, sharp images while staying an exact-likelihood model.

Classifier-free guidance

Then: no quality/diversity knob — flows generated unconditionally from the prior.
Now: guidance applied in the deep block (a new recipe for flows) trades diversity for fidelity, just like diffusion — pushing samples to diffusion-level quality from one MLE objective.

TARFlow

The core architecture — a Transformer autoregressive flow that models images directly, with diffusion-level samples from a stand-alone flow.

STARFlow

A deep–shallow design scales the flow in latent space to high-resolution, text-conditional image synthesis.

STARFlow-V

Causal roll-out extends the flow to video — an end-to-end, likelihood-based world model.

STARFlow2

One causal stream unifies the flow with a language model for multimodal understanding and generation.

The TARFlow family

Normalizing flows that scale

Same principle, three ingredients that finally make it scale

Standing on classical flows

All work