A 14B autoregressive model paired with a 157M flow matching head, training on discrete text tokens and continuous image tokens with next-token prediction objectives for state-of-the-art text-to-image generation.
NextStep-1 represents a significant advancement in autoregressive image generation technology, introducing continuous token processing that matches or surpasses traditional diffusion methods without requiring heavy computational decoders.
Direct processing of continuous image tokens without vector quantization
Lightweight 157M flow matching head paired with 14B transformer model
State-of-the-art results in text-to-image synthesis and image editing
The NextStep-1 model introduces a paradigm shift in autoregressive image generation by processing continuous tokens directly through a flow matching head, eliminating the need for vector quantization that traditionally introduces quantization loss. This approach enables the model to achieve state-of-the-art performance in text-to-image generation tasks while maintaining strong capabilities in instruction-based image editing. The research demonstrates that continuous image tokens paired with flow matching can enable autoregressive transformers to rival diffusion systems in fidelity, while the unified next-token training approach provides robust performance across multiple image manipulation tasks.
The NextStep-1 architecture employs a causal transformer that processes mixed sequences of discrete text tokens and continuous image tokens. This unified approach allows the model to understand both textual descriptions and visual content within a single framework, enabling sophisticated text-to-image generation capabilities. The transformer reads these mixed token sequences and predicts the next element using specialized prediction heads optimized for each token type.
The flow matching head represents a significant innovation in the model design, providing a lightweight 157M parameter component that steers continuous image patches from noise toward target representations. This head works by predicting velocity fields that guide the transformation process, trained using mean square error to match target flows from noise samples to desired image patches. This approach maintains high-quality generation while reducing computational overhead.
The training process follows a comprehensive five-stage approach including initial training, secondary training, annealing, supervised fine-tuning, and direct preference optimization. Learning rates transition from constant to cosine schedules with carefully balanced loss functions. The training progresses from 256-pixel resolution to mixed 256 and 512-pixel images, incorporating text-only data, image-text pairs, and image-to-image editing examples to build comprehensive capabilities.
NextStep-1 demonstrates exceptional performance in generating high-fidelity images across diverse categories including portraits, objects, animals, and complex scenes. The model excels at understanding detailed text descriptions and translating them into visually coherent and realistic images. The continuous token approach enables fine-grained control over visual elements, resulting in images that maintain consistency across multiple generations while preserving the specific details requested in text prompts.
The autoregressive nature of the model allows for progressive image construction, where each token contributes to the overall composition in a coherent manner. This approach differs significantly from traditional diffusion models by building images sequentially rather than through iterative denoising processes. The result is more predictable and controllable image generation that maintains high visual quality across different subject matters and artistic styles.
The model provides sophisticated image editing capabilities through instruction-based commands, enabling users to modify existing images with natural language descriptions. This includes adding or removing objects, changing backgrounds, modifying materials and textures, applying motion effects, and performing style transfers. The editing process maintains consistency with the original image while applying the requested modifications accurately and naturally.
Free-form manipulation capabilities allow for nuanced editing where captions guide specific actions and locations within images. The model demonstrates consistent identity preservation and scene control across multiple edits, making it suitable for complex image modification workflows. This editing approach represents a significant advancement in making image manipulation accessible through natural language interfaces.
Generate high-quality images from detailed text descriptions
Modify existing images with instruction-based editing commands
Apply different artistic styles and visual transformations
Add, remove, or modify objects within generated images
Transform backgrounds while preserving subject integrity
Create dynamic visual effects and movement illusions
NextStep-1 achieves competitive results across multiple benchmarks, demonstrating its effectiveness in both image generation and editing tasks when compared to existing autoregressive and diffusion models.
The model demonstrates strong performance across alignment benchmarks including GenEval, GenAI-Bench, and DPG-Bench. On GenEval, NextStep-1 achieves a score of 0.63, while on GenAI-Bench it scores 0.88 for basic prompts and 0.67 for advanced prompts. The DPG-Bench results show a score of 85.28, with enhanced performance when using chain-of-thought prompting techniques that improve scores to 0.73, 0.90, and 0.74 respectively.
Image editing performance is evaluated on GEditBench in both English and Chinese, as well as ImageEditBench. The model achieves scores of 7.15 for semantic consistency, 7.01 for perceptual quality, and 6.58 overall in English evaluations. Chinese language performance shows scores of 6.88, 7.02, and 6.40 respectively, with an ImageEditBench score of 3.71, positioning the model among the strongest open-source editing systems.
Continuous token processing eliminates information loss from vector quantization
Single next-token prediction objective for both generation and editing
157M parameter flow matching head reduces computational requirements
Follow these steps to install and run NextStep-1 in your environment. The model is available through Hugging Face and requires specific dependencies for optimal performance.
Create a dedicated Python environment to avoid conflicts with existing packages. Python 3.11 is recommended for optimal compatibility with the model dependencies.
Clone the model repository and install dependencies. The GIT_LFS_SKIP_SMUDGE flag prevents automatic download of large files during cloning.
Download the VAE checkpoint separately for image encoding and decoding functionality.
Import the necessary modules and load the model components. The tokenizer handles text processing while the model performs the actual generation.
Create a generation pipeline and generate images from text descriptions. The pipeline handles the complete process from text encoding to image synthesis.
NextStep-1 contributes significantly to the computer vision and machine learning research community by demonstrating the viability of continuous token processing in autoregressive models. The research provides valuable insights into alternative approaches to image generation that may inform future model architectures and training methodologies. The work challenges conventional wisdom about the necessity of vector quantization in autoregressive vision models.
The model's capabilities make it suitable for various commercial applications including content creation, digital art generation, marketing material production, and interactive design tools. The instruction-based editing capabilities enable workflow integration for graphic designers, content creators, and digital artists. The unified approach to generation and editing simplifies tool development and user interfaces.
The flow matching architecture represents a novel approach to continuous token processing that could influence future model designs. The unified training objective that handles both discrete text and continuous image tokens provides a template for multimodal model development. The lightweight head design demonstrates efficient ways to add specialized capabilities to large transformer models.
The success of NextStep-1 opens several promising research directions including scaling to higher resolutions, extending to video generation, improving computational efficiency, and exploring additional modalities. The continuous token approach may prove applicable to other domains beyond image generation, potentially influencing developments in audio processing, 3D modeling, and other creative AI applications. The model architecture provides a foundation for future research into unified multimodal generation systems.