Next-Generation Generative Models

Scalable Normalizing Flows

A research program reviving normalizing flows as scalable, end-to-end, likelihood-based generators — across images, video, language, and unified multimodality.

Projects

The TARFlow family

A connected line of work taking normalizing flows from a single architecture to a scalable, general-purpose generative paradigm. Tiles are sized by significance.

The Idea

Normalizing flows that scale

One invertible network, one likelihood objective — a single backbone for images, video, language, and unified multimodality.

Built on the Transformer Autoregressive Flow (TARFlow), this line of work shows that normalizing flows — long overshadowed by diffusion models — are in fact capable generative models that scale. By keeping the model an exact, invertible, maximum-likelihood normalizing flow while borrowing the architecture of autoregressive Transformers, these invertible models reach high-resolution image synthesis, video world models, continuous-space language modeling, and unified multimodal generation.

A normalizing flow in one idea
x f →← f-1 z ~ 𝒩(0, I)

A normalizing flow is a single invertible network f that maps data x to simple Gaussian noise z — and runs backward to map it home.

f-1
Generate. Draw a z, push it back through f-1 to a sample x.
f
Score. Map x to z and read off its exact likelihood — the same network, one objective.
p(x) = p0(f(x)) · |det ∂f/∂x|
Exact likelihood

Trained by exact maximum likelihood — one clean objective. No ELBO, no noise schedule, no discretization.

Invertible & lossless

x ↔ z is exactly reversible — encoding and generation share the very same network and weights.

Continuous, end-to-end

x stays in ℝd throughout — no codebook, no quantization. The same machinery LLMs already run at scale.

Normalizing flows were always there — RealNVP, Glow, MAF/IAF, Flow++ — and always kept exact likelihood, but lost ground to GANs and diffusion on sample quality. The Transformer revival changes the verdict: TARFlow gets diffusion-level samples from a stand-alone flow, and the work below scales that one backbone to new modalities.

Why now

Same principle, three ingredients that finally make it scale

Classical flows leaned on hand-designed coupling layers — expressive enough for densities, but their samples stayed behind GANs and diffusion. The architecture, not the principle, was the bottleneck. Three ingredients close the gap:

01
Deep–shallow Transformer flow

Then: shallow stacks of affine coupling / 1×1 convs — limited capacity, hard to scale.
Now: one deep autoregressive Transformer block carries most of the capacity (acting like a language model over tokens), plus a few cheap shallow blocks with alternating scan direction for local detail — parallelizable in the inverse.

02
Noise-augmented training

Then: exact MLE on clean data overfits high-frequency detail and yields noisy samples.
Now: Gaussian noise augmentation during training, paired with a small post-hoc denoiser at sampling — the same trick that lets the flow produce clean, sharp images while staying an exact-likelihood model.

03
Classifier-free guidance

Then: no quality/diversity knob — flows generated unconditionally from the prior.
Now: guidance applied in the deep block (a new recipe for flows) trades diversity for fidelity, just like diffusion — pushing samples to diffusion-level quality from one MLE objective.

TARFlow
The core architecture — a Transformer autoregressive flow that models images directly, with diffusion-level samples from a stand-alone flow.
STARFlow
A deep–shallow design scales the flow in latent space to high-resolution, text-conditional image synthesis.
STARFlow-V
Causal roll-out extends the flow to video — an end-to-end, likelihood-based world model.
STARFlow2
One causal stream unifies the flow with a language model for multimodal understanding and generation.
Foundations

Standing on classical flows

The normalizing-flow lineage this program builds on — exact-likelihood models that stayed behind diffusion on sample quality, until the Transformer turn.

Timeline

All work

Every paper in the program, newest first. Switch to Related to see work from the wider community.

NF-CoT arXiv 2026
Latent Reasoning with Normalizing Flows
SRC-Flow Related arXiv 2026
SRC-Flow: Compact Semantic Representations Enable Normalizing Flows for Image Generation
NTM
NTM arXiv 2026
Normalizing Trajectory Models
STARFlow2
STARFlow2 arXiv 2026
Bridging Language Models and Normalizing Flows for Unified Multimodal Generation
iTARFlow
iTARFlow ICML 2026
Normalizing Flows with Iterative Denoising
NFM
NFM arXiv 2026
The Coupling Within: Flow Matching via Distilled Normalizing Flows
Bidirectional NF Related arXiv 2025
Bidirectional Normalizing Flow: From Data to Noise and Back
FAE
FAE CVPR 2026 · Findings
One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation
SimFlow Related arXiv 2025
SimFlow: Simplified and End-to-End Training of Latent Normalizing Flows
Flowing Backwards Related arXiv 2025
Flowing Backwards: Improving Normalizing Flows via Reverse Representation Alignment
STARFlow-V CVPR 2026 · Highlight
End-to-End Video Generative Modeling with Normalizing Flows
FARMER Related arXiv 2025
FARMER: Flow AutoRegressive Transformer over Pixels
TarFlowLM
TarFlowLM NeurIPS 2025
Flexible Language Modeling in Continuous Space with Transformer-based Autoregressive Flows
STARFlow
STARFlow NeurIPS 2025 · Spotlight, top 3%
Scaling Latent Normalizing Flows for High-resolution Image Synthesis
Selective Jacobi Related arXiv 2025
Inference Acceleration of Autoregressive Normalizing Flows by Selective Jacobi Decoding
GS-Jacobi Related arXiv 2025
Accelerate TarFlow Sampling with GS-Jacobi Iteration
Jet Related arXiv 2024
Jet: A Modern Transformer-Based Normalizing Flow
TARFlow
TARFlow ICML 2025 · Oral, top 1%
Normalizing Flows are Capable Generative Models
JetFormer Related arXiv 2024
JetFormer: An Autoregressive Generative Model of Raw Images and Text