HiDream-O1开源：8B参数像素级统一Transformer-阿里云开发者社区

引言

HiDream.ai开源了HiDream-O1-Image，一个8B参数的像素级统一生成基础模型。彻底抛弃传统的VAE和分离式文本编码器，将原始图像像素、文本token和任务条件映射到单一共享token空间，通过Unified Transformer（UiT）架构实现端到端的上下文视觉生成。仅8B参数即在GenEval、DPG、CVTG-2K等多项基准上进一步刷新SOTA。本次开源包括未蒸馏版（50步）和蒸馏版Dev变体（28步），以及推理驱动的提示代理（Reasoning-Driven Prompt Agent）。

开源地址：

ModelScope：

https://modelscope.cn/models/HiDream-ai/HiDream-O1-Image

GitHub：

https://github.com/HiDream-ai/HiDream-O1-Image

模型效果

Artificial Analysis竞技场排名

HiDream-O1-Image（代号：Peanut）在Artificial Analysis文本到图像竞技场中首次亮相即排名第8，超越Z-Image Turbo、Qwen-Image和FLUX.2 [Dev]，成为当前领先的开源权重文本到图像模型（2026-5-5）。

通用文本到图像生成

支持多种电影镜头语言、多样艺术风格和多面板故事板生成，分辨率最高达2,048×2,048。原生支持15种电影镜头和视角控制，涵盖远全景到特写的7种景别、高低角度等4种机位、以及正面到四分之三视角的4种朝向。

长文本渲染与布局控制

精准的多区域、多语言文本渲染能力，在CVTG-2K和LongText-Bench上达到或领先SOTA。支持海报、演示文稿、电商直播画面、信息图等复杂图文排版场景。

指令编辑与主体个性化

单一模型同时支持指令式图像编辑和主体驱动个性化生成。在UniSubject基准上，8B模型即在2-3主体、4-8主体和9-11主体三个配置中均取得强劲表现，200B+版本进一步超越GPT Image 2和Seedream-4.0。

模型架构

统一多模态token化

HiDream-O1-Image将所有输入统一编码为三种token类型：

文本Token：经提示代理精化后的文本指令，通过骨干网络的原生词表转换为离散token
条件Token：编辑源图或参考主体等条件图像，通过SigLip-2视觉编码器提取语义token，经可学习投影对齐到共享空间
生成Token：目标图像通过扩散过程构造噪声样本，分割为patch后经可学习patch嵌入映射到共享空间

三种token拼接后送入统一Transformer骨干进行联合上下文推理，最后通过线性预测头将输出token映射回干净图像patch。

混合统一注意力

条件token和文本token采用因果注意力，仅关注序列中前面的token，保持自回归特性。生成token采用全注意力，可关注所有token以捕获全局空间依赖。这一设计在统一Transformer中同时保留了语言建模的自回归结构和图像生成的空间一致性。

骨干网络

基于decoder-only Transformer架构，继承自大语言模型。8B版本从Qwen3-VL-8B-Instruct初始化，复用其多模态预对齐能力；200B+版本将像素级统一Transformer扩展至超过2000亿参数。两个版本均采用RMSNorm、SwiGLU和RoPE，扩散时间步编码为额外的特殊token。

训练目标

采用联合优化目标：Flow Matching损失用于图像预测，LPIPS损失和感知DINO损失提供感知监督，平衡结构回归与感知对齐。

训练流程

渐进式预训练（三阶段）

阶段	分辨率	训练任务
基础对齐	512×512	T2I + 语言建模 + 多模态理解，大batch处理数十亿图文对
通才上下文学习	1024×1024	扩展至图像编辑和主体个性化，集成Prompt Agent推理
高保真精炼	2048×2048	超高分辨率子集精炼细节和感知质量

全阶段保留原始宽高比，支持灵活多分辨率生成。

后训练（两阶段）

SFT：数十万高质量样本提升视觉美学和写实感，同步微调Prompt Agent的推理轨迹，将时间步采样从Logit-Normal改为均匀采样以增强后期去噪细节
RLHF：采用GRPO，聚合OCR准确率、美学评分、指令遵循和推理质量等多维奖励信号

蒸馏加速（Dev版本）

HiDream-O1-Image-Dev通过对抗扩散蒸馏将推理步数从50步压缩至28步。采用DMD目标对齐学生与教师的轨迹分布，配合标准扩散损失和对抗损失联合优化，在保持感知保真度的同时大幅提升推理速度。

模型推理与训练

使用官方仓库推理

环境准备：

git clone https://github.com/HiDream-ai/HiDream-O1-Image.git
cd HiDream-O1-Image
pip install -r requirements.txt

建议安装flash-attn，如果您未安装（或无法安装）flash-attn，则必须编辑 models/pipeline.py 第 291 行，将 "use_flash_attn": True 改为 "use_flash_attn": False —— 否则推理时将因无法导入内核而失败。

模型推理：

文生图脚本：

python inference.py \
    --model_path /path/to/HiDream-O1-Image \
    --prompt "medium shot, eye-level, front view. A woman is seated in an ornate bedroom, illuminated by candlelight, with a calm and composed expression. The subject is a young woman with fair skin, light brown hair styled in an updo with loose tendrils framing her face, and blue eyes. She wears a cream-colored satin robe with delicate floral embroidery and lace trim along the neckline. Her ears are adorned with pearl drop earrings. She is seated on a bed with a dark, intricately carved wooden headboard. To her left, a wooden nightstand holds three lit white candles and a candelabra with multiple lit candles in the background. The bed is covered with patterned pillows and a dark, textured blanket. The walls are paneled with dark wood and feature a large, ornate tapestry with muted earth tones. The lighting creates soft highlights on her face and robe, with warm shadows cast across the room." \
    --output_image results/t2i.png \
    --height 2048 \
    --width 2048

单张图像编辑脚本：

python inference.py \
    --model_path /path/to/HiDream-O1-Image \
    --prompt "remove the earphones" \
    --ref_images assets/edit/test.jpg \
    --output_image results/edit.png \
    --keep_original_aspect

多张图的图像编辑脚本：

python inference.py \
    --model_path /path/to/HiDream-O1-Image \
    --prompt "A young boy with blonde hair stands on steps wearing light blue jeans, a white t-shirt with logo, and blue and white sneakers. He wears a brown cord necklace with beads, a black wristwatch with digital display, and carries a yellow fanny pack with white zipper. In his hand is a red boxing glove with white top, a teal plastic toy car, and a plastic toy figure of Captain America. He wears a straw hat with cream band. Natural light illuminates the scene." \
    --ref_images assets/IP/1.jpg assets/IP/2.jpg assets/IP/3.jpg assets/IP/4.jpg assets/IP/5.jpg assets/IP/6.jpg assets/IP/7.jpg assets/IP/8.jpg assets/IP/9.jpg assets/IP/10.jpg \
    --output_image results/subject.png

使用Diffsynth-Studio推理

环境安装：

git clone https://github.com/modelscope/DiffSynth-Studio.git  
cd DiffSynth-Studio
pip install -e .

推理脚本（包含文生图和图生图）：

import torch
from diffsynth.pipelines.hidream_o1_image import HiDreamO1ImagePipeline
from diffsynth.core.loader.config import ModelConfig
pipe = HiDreamO1ImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="HiDream-ai/HiDream-O1-Image", origin_file_pattern="model-*.safetensors"),
    ],
    processor_config=ModelConfig(model_id="HiDream-ai/HiDream-O1-Image", origin_file_pattern="./"),
)
image = pipe(
    prompt="medium shot, eye-level, front view. A woman is seated in an ornate bedroom, illuminated by candlelight, with a calm and composed expression. The subject is a young woman with fair skin, light brown hair styled in an updo with loose tendrils framing her face, and blue eyes. She wears a cream-colored satin robe with delicate floral embroidery and lace trim along the neckline. Her ears are adorned with pearl drop earrings. She is seated on a bed with a dark, intricately carved wooden headboard. To her left, a wooden nightstand holds three lit white candles and a candelabra with multiple lit candles in the background. The bed is covered with patterned pillows and a dark, textured blanket. The walls are paneled with dark wood and feature a large, ornate tapestry with muted earth tones. The lighting creates soft highlights on her face and robe, with warm shadows cast across the room.",
    negative_prompt=" ",
    cfg_scale=4.0,
    height=2048,
    width=2048,
    seed=42,
    num_inference_steps=50,
)
image.save("image.jpg")
image = pipe(
    prompt="change her clothes to blue",
    negative_prompt=" ",
    cfg_scale=4.0,
    height=2048,
    width=2048,
    seed=43,
    num_inference_steps=50,
    edit_image=[image],
)
image.save("image_edit.jpg")

使用Diffsynth-Studio训练LoRA模型

文生图LoRA训练：

modelscope download --dataset DiffSynth-Studio/diffsynth_example_dataset --include "hidream_o1_image/HiDream-O1-Image/*" --local_dir ./data/diffsynth_example_dataset
accelerate launch examples/hidream_o1_image/model_training/train.py \
  --dataset_base_path data/diffsynth_example_dataset/hidream_o1_image/HiDream-O1-Image \
  --dataset_metadata_path data/diffsynth_example_dataset/hidream_o1_image/HiDream-O1-Image/metadata.csv \
  --max_pixels 4194304 \
  --dataset_repeat 50 \
  --model_id_with_origin_paths "HiDream-ai/HiDream-O1-Image:model-*.safetensors" \
  --processor_config "HiDream-ai/HiDream-O1-Image:./" \
  --learning_rate 1e-4 \
  --num_epochs 5 \
  --lora_rank 32 \
  --remove_prefix_in_ckpt "pipe.dit." \
  --output_path "./models/train/HiDream-O1-Image_lora" \
  --lora_base_model "dit" \
  --lora_target_modules "q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj" \
  --use_gradient_checkpointing \
  --noise_scale 8.0

图像编辑LoRA训练：

modelscope download --dataset DiffSynth-Studio/diffsynth_example_dataset --include "qwen_image/Qwen-Image-Edit-2511/*" --local_dir ./data/diffsynth_example_dataset
accelerate launch examples/hidream_o1_image/model_training/train.py \
  --dataset_base_path data/diffsynth_example_dataset/qwen_image/Qwen-Image-Edit-2511 \
  --dataset_metadata_path data/diffsynth_example_dataset/qwen_image/Qwen-Image-Edit-2511/metadata.json \
  --data_file_keys "image,edit_image" \
  --extra_inputs "edit_image" \
  --max_pixels 4194304 \
  --dataset_repeat 50 \
  --model_id_with_origin_paths "HiDream-ai/HiDream-O1-Image:model-*.safetensors" \
  --processor_config "HiDream-ai/HiDream-O1-Image:./" \
  --learning_rate 1e-4 \
  --num_epochs 5 \
  --lora_rank 32 \
  --remove_prefix_in_ckpt "pipe.dit." \
  --output_path "./models/train/HiDream-O1-Image_lora" \
  --lora_base_model "dit" \
  --lora_target_modules "q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj" \
  --use_gradient_checkpointing \
  --noise_scale 8.0

HiDream-O1开源：8B参数像素级统一Transformer