NextStep-1: The Future of Image Generation

A 14B autoregressive model paired with a 157M flow matching head, training on discrete text tokens and continuous image tokens with next-token prediction objectives for state-of-the-art text-to-image generation.

NextStep-1 Image Generation Example

Transforming Autoregressive Image Generation

NextStep-1 represents a significant advancement in autoregressive image generation technology, introducing continuous token processing that matches or surpasses traditional diffusion methods without requiring heavy computational decoders.

Continuous Token Processing

Direct processing of continuous image tokens without vector quantization

Flow Matching Architecture

Lightweight 157M flow matching head paired with 14B transformer model

High-Fidelity Generation

State-of-the-art results in text-to-image synthesis and image editing

Research Innovation

The NextStep-1 model introduces a paradigm shift in autoregressive image generation by processing continuous tokens directly through a flow matching head, eliminating the need for vector quantization that traditionally introduces quantization loss. This approach enables the model to achieve state-of-the-art performance in text-to-image generation tasks while maintaining strong capabilities in instruction-based image editing. The research demonstrates that continuous image tokens paired with flow matching can enable autoregressive transformers to rival diffusion systems in fidelity, while the unified next-token training approach provides robust performance across multiple image manipulation tasks.

Technical Architecture

Causal Transformer Design

The NextStep-1 architecture employs a causal transformer that processes mixed sequences of discrete text tokens and continuous image tokens. This unified approach allows the model to understand both textual descriptions and visual content within a single framework, enabling sophisticated text-to-image generation capabilities. The transformer reads these mixed token sequences and predicts the next element using specialized prediction heads optimized for each token type.

Flow Matching Integration

The flow matching head represents a significant innovation in the model design, providing a lightweight 157M parameter component that steers continuous image patches from noise toward target representations. This head works by predicting velocity fields that guide the transformation process, trained using mean square error to match target flows from noise samples to desired image patches. This approach maintains high-quality generation while reducing computational overhead.

Training Methodology

The training process follows a comprehensive five-stage approach including initial training, secondary training, annealing, supervised fine-tuning, and direct preference optimization. Learning rates transition from constant to cosine schedules with carefully balanced loss functions. The training progresses from 256-pixel resolution to mixed 256 and 512-pixel images, incorporating text-only data, image-text pairs, and image-to-image editing examples to build comprehensive capabilities.

Model Specifications

Total Parameters14B + 157M
Architecture TypeAutoregressive
Token ProcessingContinuous
Training Stages5 Phases
Maximum Resolution512x512
Flow Matching Head157M params

Comprehensive Generation Capabilities

Image Generation Excellence

NextStep-1 demonstrates exceptional performance in generating high-fidelity images across diverse categories including portraits, objects, animals, and complex scenes. The model excels at understanding detailed text descriptions and translating them into visually coherent and realistic images. The continuous token approach enables fine-grained control over visual elements, resulting in images that maintain consistency across multiple generations while preserving the specific details requested in text prompts.

The autoregressive nature of the model allows for progressive image construction, where each token contributes to the overall composition in a coherent manner. This approach differs significantly from traditional diffusion models by building images sequentially rather than through iterative denoising processes. The result is more predictable and controllable image generation that maintains high visual quality across different subject matters and artistic styles.

Advanced Editing Capabilities

The model provides sophisticated image editing capabilities through instruction-based commands, enabling users to modify existing images with natural language descriptions. This includes adding or removing objects, changing backgrounds, modifying materials and textures, applying motion effects, and performing style transfers. The editing process maintains consistency with the original image while applying the requested modifications accurately and naturally.

Free-form manipulation capabilities allow for nuanced editing where captions guide specific actions and locations within images. The model demonstrates consistent identity preservation and scene control across multiple edits, making it suitable for complex image modification workflows. This editing approach represents a significant advancement in making image manipulation accessible through natural language interfaces.

Key Capabilities

Text-to-Image Generation

Generate high-quality images from detailed text descriptions

Image Editing

Modify existing images with instruction-based editing commands

Style Transfer

Apply different artistic styles and visual transformations

Object Manipulation

Add, remove, or modify objects within generated images

Background Changes

Transform backgrounds while preserving subject integrity

Motion Effects

Create dynamic visual effects and movement illusions

State-of-the-Art Performance

NextStep-1 achieves competitive results across multiple benchmarks, demonstrating its effectiveness in both image generation and editing tasks when compared to existing autoregressive and diffusion models.

Benchmark Results

The model demonstrates strong performance across alignment benchmarks including GenEval, GenAI-Bench, and DPG-Bench. On GenEval, NextStep-1 achieves a score of 0.63, while on GenAI-Bench it scores 0.88 for basic prompts and 0.67 for advanced prompts. The DPG-Bench results show a score of 85.28, with enhanced performance when using chain-of-thought prompting techniques that improve scores to 0.73, 0.90, and 0.74 respectively.

Image editing performance is evaluated on GEditBench in both English and Chinese, as well as ImageEditBench. The model achieves scores of 7.15 for semantic consistency, 7.01 for perceptual quality, and 6.58 overall in English evaluations. Chinese language performance shows scores of 6.88, 7.02, and 6.40 respectively, with an ImageEditBench score of 3.71, positioning the model among the strongest open-source editing systems.

Performance Metrics

GenEval Score0.63
GenAI-Bench Basic0.88
GenAI-Bench Advanced0.67
DPG-Bench85.28
Semantic Consistency7.15
Perceptual Quality7.01

Technical Advantages

1

No Quantization Loss

Continuous token processing eliminates information loss from vector quantization

2

Unified Training

Single next-token prediction objective for both generation and editing

3

Lightweight Head

157M parameter flow matching head reduces computational requirements

Getting Started with NextStep-1

Follow these steps to install and run NextStep-1 in your environment. The model is available through Hugging Face and requires specific dependencies for optimal performance.

Environment Setup

conda create -n nextstep python=3.11 -y
conda activate nextstep
pip install uv # optional

Create a dedicated Python environment to avoid conflicts with existing packages. Python 3.11 is recommended for optimal compatibility with the model dependencies.

Model Download

GIT_LFS_SKIP_SMUDGE=1 git clone \
https://huggingface.co/stepfun-ai/NextStep-1-Large
cd NextStep-1-Large
uv pip install -r requirements.txt

Clone the model repository and install dependencies. The GIT_LFS_SKIP_SMUDGE flag prevents automatic download of large files during cloning.

VAE Checkpoint

hf download stepfun-ai/NextStep-1-Large \
"vae/checkpoint.pt" --local-dir ./

Download the VAE checkpoint separately for image encoding and decoding functionality.

Basic Usage Example

import
torch
from
transformers import AutoTokenizer, AutoModel
from
models.gen_pipeline import NextStepPipeline
HF_HUB = "stepfun-ai/NextStep-1-Large"
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(HF_HUB)
model = AutoModel.from_pretrained(HF_HUB)

Import the necessary modules and load the model components. The tokenizer handles text processing while the model performs the actual generation.

Image Generation

pipeline = NextStepPipeline(
tokenizer=tokenizer, model=model
).to(device="cuda", dtype=torch.bfloat16)
prompt = "A realistic photograph of a cat"
image = pipeline.generate_image(prompt)

Create a generation pipeline and generate images from text descriptions. The pipeline handles the complete process from text encoding to image synthesis.

System Requirements

GPU Memory16GB+ VRAM
System RAM32GB+
Python Version3.11+
CUDA Version11.8+

Research Impact and Applications

Academic Research

NextStep-1 contributes significantly to the computer vision and machine learning research community by demonstrating the viability of continuous token processing in autoregressive models. The research provides valuable insights into alternative approaches to image generation that may inform future model architectures and training methodologies. The work challenges conventional wisdom about the necessity of vector quantization in autoregressive vision models.

Industry Applications

The model's capabilities make it suitable for various commercial applications including content creation, digital art generation, marketing material production, and interactive design tools. The instruction-based editing capabilities enable workflow integration for graphic designers, content creators, and digital artists. The unified approach to generation and editing simplifies tool development and user interfaces.

Technical Innovation

The flow matching architecture represents a novel approach to continuous token processing that could influence future model designs. The unified training objective that handles both discrete text and continuous image tokens provides a template for multimodal model development. The lightweight head design demonstrates efficient ways to add specialized capabilities to large transformer models.

Future Directions

The success of NextStep-1 opens several promising research directions including scaling to higher resolutions, extending to video generation, improving computational efficiency, and exploring additional modalities. The continuous token approach may prove applicable to other domains beyond image generation, potentially influencing developments in audio processing, 3D modeling, and other creative AI applications. The model architecture provides a foundation for future research into unified multimodal generation systems.