前言
前面我们了解了Hugging Face 核心库 Transformers 的使用方式,今天我们继续探索Hugging Face核心库 Diffusers 库的使用方式。对往期内容感兴趣的小伙伴也可以看往期:
- 【Hugging Face】Hugging Face Hub与Hugging Face CLI【Hugging Face】Hugging Face Space空间的基本使用方式【Hugging Face】Hugging Face数据集的基本使用【Hugging Face】Hugging face模型的基本使用【Hugging Face】Hugging Face Transformers的使用方式
简介
Diffusers 是Hugging Face开源的 扩散模型(Diffusion Models)一站式工具箱,把最前沿的扩散相关论文 / 权重 封装成简单、可组合的 API,让你用几行代码就能做文生图、图生图、音频生成、视频生成、3D 生成、分子生成、数据增强等任务,也支持训练 / 微调自己的模型。
Diffusers官网文档:huggingface.co/docs/diffus…
Diffusers核心API
- 管道:从高层次设计的多种类函数,目的在于方便部署和实现任务,能够快速的用于训练好的主流扩散模型来生成样本模型:在训练新的扩散模型的时候需要用到网络结构,比如UNet调度器:在推理的过程中使用多种不同的技巧来从噪声中生成图像,同时也可以生成训练过程中带噪声的图像。
安装
安装PyTorch
安装CPU版本
$ pip install torch torchvision torchaudio
安装NVIDIA GPU版本,更多安装方式可以查看PyTorch官网:pytorch.org/
$ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
安装Diffusers
# Accelerate 加速模型加载,用于推理和训练# Transformers 是运行最受欢迎的扩散模型(如 Stable Diffusion)所必需的$ uv add diffusers accelerate transformers或者$ pip install diffusers accelerate transformers
基本使用
这是AI生成的Diffuser文生图的整个流程,有助于我们理解整个操作流程
Pipeline管道
在Diffusers中,Pipeline是把模型、调度器、处理等组件“粘合”起来的一条生产线。首先来看一下,Pipeline的默认使用方式。以sd-dreambooth-library/disco-diffusion-style文生图模型为例,新建一个Colab代码块
from diffusers import DiffusionPipelineimport torchdevice = "cuda" if torch.cuda.is_available() else "cpu"# 加载管道,会先下载训练好的模型pipeline = DiffusionPipeline.from_pretrained("sd-dreambooth-library/disco-diffusion-style")pipeline.to(device)# prompt,最好使用英文,中文效果不太好prompt = "A cyberpunk-style building"# 图片生成image = pipeline( prompt, num_inference_steps=50, # 迭代步数 guidance_scale=7.5, # 引导系数).images[0]# 展示图片display(image)
点击运行,效果如下
Diffusers中还包含了很多Pipeline类型,下面是一些生图的Pipeline,感兴趣的小伙伴可以自行了解更多
- Text-To-Image:文生图,StableDiffusionPipelineImage-To-Image:图生图,StableDiffusionImg2ImgPipelineIn-Painting:蒙版重绘,StableDiffusionInpaintPipelineUpscale Image:超分辨率(放大4倍),StableDiffusionUpscalePipelinePix-To-Pix:图像画风编辑,StableDiffusionInstructPix2PixPipelineDepth-To-Image:深度绘图,StableDiffusionDepth2ImgPipeline
DiffusionPipeline
使用DiffusionPipeline实例化模型时,如果模型没有下载,会自动下载模型
Pipeline中包含多种管道,DiffusionPipeline是用预训练的扩散系统进行推理的最简单方法。它是一个包含模型和调度器的端到端系统,可以直接使用DiffusionPipeline完成许多任务。
Pipeline使用 from_pretrained() 方法加载模型
from diffusers import DiffusionPipeline# 加载模型,use_safetensors使用安全无代码格式,建议默认加上pipeline = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", use_safetensors=True)print(pipeline)
DiffusionPipeline会下载并缓存所有建模、分词和调度组件。打印pipeline对象可以发现Stable Diffusion管道由 UNet2DConditionModel 和 PNDMScheduler 等组件组成,DiffusionPipeline为抽象类底层会被更改为StableDiffusionPipeline
使用Pipeline生成图像
image = pipeline("An image of a squirrel in Picasso style").images[0]display(image)
点击运行,生图效果如下:
GPU加速
Pipeline同样可以将管道放在 GPU 上,就像你使用任何 PyTorch 模块一样为管道添加GPU加速
import torchfrom diffusers import DiffusionPipelinedevice = "cuda" if torch.cuda.is_available() else "cpu"pipeline = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", use_safetensors=True)# 在GPU上运行管道pipeline.to(device)
Pipeline参数
参数1:精度
默认情况下,DiffusionPipeline 使用完整 float32 精度进行 50 步推理,为了加速生图过程我们可以选择降低精度为 float16 或减少推理步数
import torchfrom diffusers import DiffusionPipelinepipeline = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, use_safetensors=True)
参数2:提示词
提示词是生图主体的关键内容
# 提示词prompt = "portrait photo of a old warrior chief"image = pipeline(prompt).images[0]
参数3:随机种子
为了确保我们可以使用同一张图像并对其进行改进,我们使用 Generator 并设置一个种子以实现可重复性
import torchgenerator = torch.Generator("cuda").manual_seed(0)image = pipeline(prompt, generator=generator).images[0]
参数4:迭代步数
Stable Diffusion 模型默认使用 PNDMScheduler,通常需要~50 步推理,但像 DPMSolverMultistepScheduler 这样的性能更好的调度器,只需要 20 或 25 步推理。
from diffusers import DiffusionPipelinefrom diffusers import DPMSolverMultistepSchedulerimport torchpipeline = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True)# 加载调度器pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)image = pipeline( prompt, num_inference_steps=25, # 迭代步数).images[0]
参数5:反向提示词
正反向提示词用于控制生图中想要和不想要内容
negative_prompt = "low quality, bad anatomy, deformed, blurry"image = pipeline( prompt, negative_prompt=negative_prompt, # 反向提示词 height=height, width=width, num_inference_steps=num_inference_steps, generator=generator ).images[0]
参数6:图片宽高
图片宽高用于控制生图的尺寸大小
# 设置生成图像的参数height = 512width = 512image = pipeline( prompt, height=height, # 图片高 width=width, # 图片宽).images[0]
参数7:提示词引导系数
提示词引导系数决定了生图效果与提示词的关系,提示词系数越大生图效果越贴近提示词,提示词系数接近 1 时提示词将被忽略
- 1.0-3.0:完全忽略提示词,自由生成3.0-10.0:提示词与创意平衡,7.0-7.5为默认黄金区,官方示例都用7.510.0-15.0:需要精确还原 prompt
image = pipeline( prompt, guidance_scale=7.5, # 提示词引导系数).images[0]
enable_attention_slicing(节省内存)
其他配置不变,只加上 enable_attention_slicing
from diffusers import DiffusionPipelinefrom diffusers import DPMSolverMultistepSchedulerimport torchfrom diffusers.utils import make_image_griddevice = "cuda" if torch.cuda.is_available() else "cpu"# 加载pipelinepipeline = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True)pipeline.to(device)# 开启enable_attention_slicing节省内存pipeline.enable_attention_slicing() # 加载调度器pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)prompt = "portrait photo of a old warrior chief"def get_inputs(batch_size=1): generator = [torch.Generator("cuda").manual_seed(i) for i in range(batch_size)] prompts = batch_size * [prompt] num_inference_steps = 20 return {"prompt": prompts, "generator": generator, "num_inference_steps": num_inference_steps}images = pipeline(**get_inputs(batch_size=8)).imagesmake_image_grid(images, 2, 4)
一次生成了8张图片且没有报 OOM 错误
VAE
VAE是一个优化显存和提速的操作,VAE 把高分辨率像素空间(512×512×3 ≈ 786 k 元素)压缩成低维潜空间(64×64×4 ≈ 16 k 元素),让扩散模型在更小、更快的空间里做“加噪 / 去噪”,最后再解码回真实图像。
from diffusers import DiffusionPipeline, AutoencoderKLimport torchdevice = "cuda" if torch.cuda.is_available() else "cpu"# 加载pipelinepipeline = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True)vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16).to(device)pipeline.vae = vaepipeline.to(device)prompt = "portrait photo of a old warrior chief"def get_inputs(batch_size=1): generator = [torch.Generator(device).manual_seed(i) for i in range(batch_size)] prompts = batch_size * [prompt] num_inference_steps = 20 return {"prompt": prompts, "generator": generator, "num_inference_steps": num_inference_steps}images = pipeline(**get_inputs(batch_size=8)).imagesmake_image_grid(images, 2, 4)
AutoPipeline
AutoPipeline 类的设计旨在简化 Diffusers 中的各种管道。它是一个通用的任务优先级管道,让你专注于任务(AutoPipelineForText2Image、AutoPipelineForImage2Image 和 AutoPipelineForInpainting),而无需知道具体的管道类。AutoPipeline 会自动检测应使用的正确管道类。
AutoPipelineForText2Image文生图示例
from diffusers import AutoPipelineForText2Imageimport torchpipe_txt2img = AutoPipelineForText2Image.from_pretrained( "dreamlike-art/dreamlike-photoreal-2.0", torch_dtype=torch.float16, use_safetensors=True).to("cuda")prompt = "cinematic photo of Godzilla eating sushi with a cat in a izakaya, 35mm photograph, film, professional, 4k, highly detailed"generator = torch.Generator(device="cpu").manual_seed(37)image = pipe_txt2img(prompt, generator=generator).images[0]image
AutoPipelineForImage2Image图生图示例
from diffusers import AutoPipelineForImage2Imagefrom diffusers.utils import load_imageimport torchpipe_img2img = AutoPipelineForImage2Image.from_pretrained( "dreamlike-art/dreamlike-photoreal-2.0", torch_dtype=torch.float16, use_safetensors=True).to("cuda")pipeline.enable_model_cpu_offload()# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installedpipeline.enable_xformers_memory_efficient_attention()init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/autopipeline-text2img.png")prompt = "cinematic photo of Godzilla eating burgers with a cat in a fast food restaurant, 35mm photograph, film, professional, 4k, highly detailed"generator = torch.Generator(device="cpu").manual_seed(53)image = pipe_img2img(prompt, image=init_image, generator=generator).images[0]image
AutoPipelineForInpainting图片修复示例
from diffusers import AutoPipelineForInpaintingfrom diffusers.utils import load_imageimport torchpipeline = AutoPipelineForInpainting.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, use_safetensors=True).to("cuda")init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/autopipeline-img2img.png")mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/autopipeline-mask.png")prompt = "cinematic photo of a owl, 35mm photograph, film, professional, 4k, highly detailed"generator = torch.Generator(device="cpu").manual_seed(38)image = pipeline(prompt, image=init_image, mask_image=mask_image, generator=generator, strength=0.4).images[0]image
模型
如果不确定要使用哪个模型,可以使用 AutoModel API 自动选择模型
在 Diffusers 中,扩散模型在进行生成图片的时候,是噪声扩散的逆过程,模型是真正负责学习、预测噪声或压缩/还原功能的可学习网络。
Diffusers中有4类最常用的模型:
- UNet(条件/无条件):学习「当前带噪图 → 噪声残差」或「直接预测原图」, 典型实例 UNet2DConditionModel、UNet2DModel, 默认模型VAE/AutoencoderKL:把高维像素空间压缩到低维潜空间,节省计算;推理时再解码,典型实例 AutoencoderKLText Encoder/CLIP:把提示词编码成「条件向量」供条件扩散模型使用,典型实例 CLIPTextModelControlNet/LoRA:在不改原 UNet 权重的前提下,注入额外条件或微调,典型实例 ControlNetModel、LoRAAdapter
官网图片降噪示例
下面通过官方示例看一下,完整的降噪过程,我对图片展示部分做了修改,其他的代码和官方是保持一致的。
from diffusers import UNet2DModel, DDPMSchedulerimport torchimport tqdmimport PIL.Imageimport numpy as npfrom diffusers.utils import make_image_griddevice = "cuda" if torch.cuda.is_available() else "cpu"# 加载模型model_id = "google/ddpm-cat-256"model = UNet2DModel.from_pretrained(model_id)# 将模型放在GPU上model.to(device)# 加载调度器scheduler = DDPMScheduler.from_config(model.config)# 生成随机种子torch.manual_seed(0)# 设置图片尺寸noisy_sample = torch.randn(1, model.config.in_channels, model.config.sample_size, model.config.sample_size)# 将照片放到GPU上noisy_sample = noisy_sample.to(device)sample = noisy_sample# Helper function to convert tensor to PIL Imagedef convert_to_pil_image(sample_tensor): image_processed = sample_tensor.cpu().permute(0, 2, 3, 1) image_processed = (image_processed + 1.0) * 127.5 image_processed = image_processed.numpy().astype(np.uint8) image_pil = PIL.Image.fromarray(image_processed[0]) return image_pil# 开始生成一个只猫images = []for i, t in enumerate(tqdm.tqdm(scheduler.timesteps)): # 1. predict noise residual with torch.no_grad(): residual = model(sample, t).sample # 2. compute less noisy image and set x_t -> x_t-1 sample = scheduler.step(residual, t, sample).prev_sample # 3. optionally look at image if (i + 1) % 50 == 0: pil_image = convert_to_pil_image(sample) images.append(pil_image) # Append the PIL image to the list# Display the collected images after the loopmake_image_grid(images, 5, 4)
运行完成,我们将得到图片的整个去噪过程图,虽然最终生成的效果不是很好看,但是大概也能看出整个图片生成的过程了
有条件/无条件模型
- UNet2DModel:无条件 2D UNet,只认识「带噪图 + 时间步」。UNet2DConditionModel:有条件 2D UNet,额外吃「文本/深度/姿态等条件向量」,是 Stable Diffusion、ControlNet 等的核心网络。
模型通过 from_pretrained() 方法初始化,该方法还会在本地缓存模型权重,因此下次加载模型时会更快。
from diffusers import scheduler, UNet2DModelrepo_id = "google/ddpm-cat-256"scheduler = DDPMScheduler.from_pretrained("google/ddpm-cat-256")model = UNet2DModel.from_pretrained(repo_id, use_safetensors=True)print(model.config)
不知道使用哪个模型类,也可以使用AutoModel
from diffusers import AutoModelrepo_id = "google/ddpm-cat-256"model = AutoModel.from_pretrained(repo_id, use_safetensors=True)print(model.config)
模型打印结果如下
模型参数:
- sample_size:输入样本的高度和宽度尺寸in_channels:输入样本的输入通道数down_block_types和up_block_types:用于创建U-Net架构的下采样和上采样块的类型block_out_channels:下采样块的输出通道数;也以相反的顺序用于上采样块的输入通道数layers_per_block:每个U-Net块中存在的ResNet块的数量。
无条件降噪生图案例,可以看到上面的官方示例,这里注意了解下,有条件的生图过程
from PIL import Imageimport torchfrom transformers import CLIPTextModel, CLIPTokenizerfrom diffusers import AutoencoderKL, UNet2DConditionModel, UniPCMultistepSchedulerfrom tqdm.auto import tqdmdevice = "cuda" if torch.cuda.is_available() else "cpu"model_name = "CompVis/stable-diffusion-v1-4"# vaevae = AutoencoderKL.from_pretrained(model_name, subfolder="vae", use_safetensors=True)# 分词器tokenizer = CLIPTokenizer.from_pretrained(model_name, subfolder="tokenizer")# 文本编码器text_encoder = CLIPTextModel.from_pretrained(model_name, subfolder="text_encoder", use_safetensors=True)# 模型unet = UNet2DConditionModel.from_pretrained(model_name, subfolder="unet", use_safetensors=True)# 调度器scheduler = UniPCMultistepScheduler.from_pretrained(model_name, subfolder="scheduler")# 加速推理vae.to(device)text_encoder.to(device)unet.to(device)prompt = ["a photograph of an astronaut riding a horse"]height = 512 # default height of Stable Diffusionwidth = 512 # default width of Stable Diffusionnum_inference_steps = 25 # Number of denoising stepsguidance_scale = 7.5 # Scale for classifier-free guidance# Seed generator to create the initial latent noisegenerator = torch.Generator(device).manual_seed(0) # Ensure generator is on the correct devicebatch_size = len(prompt)# 对提示词进行分词text_input = tokenizer( prompt, padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt")with torch.no_grad(): text_embeddings = text_encoder(text_input.input_ids.to(device))[0]max_length = text_input.input_ids.shape[-1]# 填充标记的嵌入uncond_input = tokenizer([""] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt")uncond_embeddings = text_encoder(uncond_input.input_ids.to(device))[0]# 将条件嵌入和无条件嵌入连接成一个批次,以避免进行两次前向传递text_embeddings = torch.cat([uncond_embeddings, text_embeddings])print(device)# 之所以被“8”整除,因为vae模型有3次下采样latents = torch.randn( (batch_size, unet.config.in_channels, height // 8, width // 8), generator=generator, device=device,)# 去噪图像latents = latents * scheduler.init_noise_sigma# 循环时间步长scheduler.set_timesteps(num_inference_steps)for t in tqdm(scheduler.timesteps): # expand the latents if we are doing classifier-free guidance to avoid doing two forward passes. latent_model_input = torch.cat([latents] * 2) latent_model_input = scheduler.scale_model_input(latent_model_input, timestep=t) # predict the noise residual with torch.no_grad(): noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample # perform guidance noise_pred_uncond, noise_pred_text = noise_pred.chunk(2) noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond) # compute the previous noisy sample x_t -> x_t-1 latents = scheduler.step(noise_pred, t, latents).prev_sample# 图片解码latents = 1 / 0.18215 * latentswith torch.no_grad(): image = vae.decode(latents).sampleimage = (image / 2 + 0.5).clamp(0, 1).squeeze()image = (image.permute(1, 2, 0) * 255).to(torch.uint8).cpu().numpy()image = Image.fromarray(image)image
示例中CLIPTextModel是文本编码器,用于把toknes编码为向量,用来控制扩散模型的生成。CLIPTextModel需要接收通过CLIPTokenizer分词器处理的提示词文本,这就和我们上期了解到的transformers联系起来了。这是手动实现有条件生图的过程,通过Pipeline将会大大简化生图流程,最后看下生图效果:
ControlNetModel
ControlNetModel通过在边缘图、深度图、分割图和姿态检测关键点等额外输入条件下对模型进行调节,从而在文本到图像生成方面提供了更高的控制度。
from diffusers import StableDiffusionControlNetPipeline, ControlNetModelurl = "https://huggingface.co/lllyasviel/ControlNet-v1-1/blob/main/control_v11p_sd15_canny.pth" # can also be a local pathcontrolnet = ControlNetModel.from_single_file(url)url = "https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/blob/main/v1-5-pruned.safetensors" # can also be a local pathpipeline = StableDiffusionControlNetPipeline.from_single_file(url, controlnet=controlnet)
调度器
使用过stable diffusion的小伙伴应该听过一个词叫采样器,调度器其实就是stable diffusion中的采样器。调度器是一个非常重要的部分,不同的调度器具有不同的去噪速度和质量权衡,Pipeline的默认调度器是PNDMScheduler。
Euler调度器
from diffusers import DiffusionPipeline, EulerDiscreteSchedulerimport torchdevice = "cuda" if torch.cuda.is_available() else "cpu"# 加载管道,会先下载训练好的模型pipeline = DiffusionPipeline.from_pretrained("sd-dreambooth-library/disco-diffusion-style")# 添加调度器pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config)pipeline.to(device)# prompt,最好使用英文,中文效果不太好prompt = "A cyberpunk-style building"# 图片生成image = pipeline( prompt, num_inference_steps=50, # 迭代步数 guidance_scale=7.5, # 引导系数).images[0]# 展示图片display(image)
EulerA调度器
from diffusers import DiffusionPipeline, EulerAncestralDiscreteSchedulerimport torchdevice = "cuda" if torch.cuda.is_available() else "cpu"# 加载管道,会先下载训练好的模型pipeline = DiffusionPipeline.from_pretrained("sd-dreambooth-library/disco-diffusion-style")# 添加调度器pipeline.scheduler = EulerAncestralDiscreteScheduler.from_config(pipeline.scheduler.config)pipeline.to(device)# prompt,最好使用英文,中文效果不太好prompt = "A cyberpunk-style building"# 图片生成image = pipeline( prompt, num_inference_steps=50, # 迭代步数 guidance_scale=7.5, # 引导系数).images[0]# 展示图片display(image)
DPM++2M调度器
from diffusers import DiffusionPipeline, DPMSolverMultistepSchedulerimport torchdevice = "cuda" if torch.cuda.is_available() else "cpu"# 加载管道,会先下载训练好的模型pipeline = DiffusionPipeline.from_pretrained("sd-dreambooth-library/disco-diffusion-style")# 添加调度器pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)pipeline.to(device)# prompt,最好使用英文,中文效果不太好prompt = "A cyberpunk-style building"# 图片生成image = pipeline( prompt, num_inference_steps=50, # 迭代步数 guidance_scale=7.5, # 引导系数).images[0]# 展示图片display(image)
DPM++ 2M Karras调度器(推荐)
from diffusers import DiffusionPipeline, DPMSolverMultistepSchedulerimport torchdevice = "cuda" if torch.cuda.is_available() else "cpu"# 加载管道,会先下载训练好的模型pipeline = DiffusionPipeline.from_pretrained("sd-dreambooth-library/disco-diffusion-style")# 添加调度器pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config, use_karras_sigmas=True)pipeline.to(device)# prompt,最好使用英文,中文效果不太好prompt = "A cyberpunk-style building"# 图片生成image = pipeline( prompt, num_inference_steps=50, # 迭代步数 guidance_scale=7.5, # 引导系数).images[0]# 展示图片display(image)
训练扩散模型
第1步:训练配置
创建一个 TrainingConfig 来包含训练参数
from dataclasses import dataclass@dataclassclass TrainingConfig: image_size = 128 # the generated image resolution train_batch_size = 16 eval_batch_size = 16 # how many images to sample during evaluation num_epochs = 50 gradient_accumulation_steps = 1 learning_rate = 1e-4 lr_warmup_steps = 500 save_image_epochs = 10 save_model_epochs = 30 mixed_precision = "fp16" # `no` for float32, `fp16` for automatic mixed precision output_dir = "ddpm-butterflies-128" # the model name locally and on the HF Hub push_to_hub = True # whether to upload the saved model to the HF Hub hub_model_id = "<your-username>/<my-awesome-model>" # the name of the repository to create on the HF Hub hub_private_repo = None overwrite_output_dir = True # overwrite the old model when re-running the notebook seed = 0config = TrainingConfig()
第2步:加载数据集,调整训练数据
加载训练数据集
from datasets import load_datasetconfig.dataset_name = "huggan/smithsonian_butterflies_subset"dataset = load_dataset(config.dataset_name, split="train")
可视化数据集Image
import matplotlib.pyplot as pltfig, axs = plt.subplots(1, 4, figsize=(16, 4))for i, image in enumerate(dataset[:4]["image"]): axs[i].imshow(image) axs[i].set_axis_off()fig.show()
对图片尺寸进行调整
from torchvision import transformspreprocess = transforms.Compose( [ transforms.Resize((config.image_size, config.image_size)), transforms.RandomHorizontalFlip(), transforms.ToTensor(), transforms.Normalize([0.5], [0.5]), ])
将图像通道转换为RGB
def transform(examples): images = [preprocess(image.convert("RGB")) for image in examples["image"]] return {"images": images}dataset.set_transform(transform)
再次可视化确认图像大小
import torchtrain_dataloader = torch.utils.data.DataLoader(dataset, batch_size=config.train_batch_size, shuffle=True)
第3步:创建UNet2DModel
from diffusers import UNet2DModelmodel = UNet2DModel( sample_size=config.image_size, # the target image resolution in_channels=3, # the number of input channels, 3 for RGB images out_channels=3, # the number of output channels layers_per_block=2, # how many ResNet layers to use per UNet block block_out_channels=(128, 128, 256, 256, 512, 512), # the number of output channels for each UNet block down_block_types=( "DownBlock2D", # a regular ResNet downsampling block "DownBlock2D", "DownBlock2D", "DownBlock2D", "AttnDownBlock2D", # a ResNet downsampling block with spatial self-attention "DownBlock2D", ), up_block_types=( "UpBlock2D", # a regular ResNet upsampling block "AttnUpBlock2D", # a ResNet upsampling block with spatial self-attention "UpBlock2D", "UpBlock2D", "UpBlock2D", "UpBlock2D", ),)sample_image = dataset[0]["images"].unsqueeze(0)print("Input shape:", sample_image.shape)print("Output shape:", model(sample_image, timestep=0).sample.shape)
第4步:创建一个调度器
import torchfrom PIL import Imagefrom diffusers import DDPMScheduler# 调度器noise_scheduler = DDPMScheduler(num_train_timesteps=1000)noise = torch.randn(sample_image.shape)timesteps = torch.LongTensor([50])# 调度器为图片添加噪声noisy_image = noise_scheduler.add_noise(sample_image, noise, timesteps)# 查看噪声图Image.fromarray(((noisy_image.permute(0, 2, 3, 1) + 1.0) * 127.5).type(torch.uint8).numpy()[0])
计算训练损失
import torch.nn.functional as Fnoise_pred = model(noisy_image, timesteps).sampleloss = F.mse_loss(noise_pred, noise)
第5步:训练模型
创建优化器和学习率调度器
from diffusers.optimization import get_cosine_schedule_with_warmupoptimizer = torch.optim.AdamW(model.parameters(), lr=config.learning_rate)lr_scheduler = get_cosine_schedule_with_warmup( optimizer=optimizer, num_warmup_steps=config.lr_warmup_steps, num_training_steps=(len(train_dataloader) * config.num_epochs),)
创建评估模型
from diffusers import DDPMPipelinefrom diffusers.utils import make_image_gridimport osdef evaluate(config, epoch, pipeline): # Sample some images from random noise (this is the backward diffusion process). # The default pipeline output type is `List[PIL.Image]` images = pipeline( batch_size=config.eval_batch_size, generator=torch.Generator(device='cpu').manual_seed(config.seed), # Use a separate torch generator to avoid rewinding the random state of the main training loop ).images # Make a grid out of the images image_grid = make_image_grid(images, rows=4, cols=4) # Save the images test_dir = os.path.join(config.output_dir, "samples") os.makedirs(test_dir, exist_ok=True) image_grid.save(f"{test_dir}/{epoch:04d}.png")
from accelerate import Acceleratorfrom huggingface_hub import create_repo, upload_folderfrom tqdm.auto import tqdmfrom pathlib import Pathimport osdef train_loop(config, model, noise_scheduler, optimizer, train_dataloader, lr_scheduler): # Initialize accelerator and tensorboard logging accelerator = Accelerator( mixed_precision=config.mixed_precision, gradient_accumulation_steps=config.gradient_accumulation_steps, log_with="tensorboard", project_dir=os.path.join(config.output_dir, "logs"), ) if accelerator.is_main_process: if config.output_dir is not None: os.makedirs(config.output_dir, exist_ok=True) if config.push_to_hub: repo_id = create_repo( repo_id=config.hub_model_id or Path(config.output_dir).name, exist_ok=True ).repo_id accelerator.init_trackers("train_example") # Prepare everything # There is no specific order to remember, you just need to unpack the # objects in the same order you gave them to the prepare method. model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare( model, optimizer, train_dataloader, lr_scheduler ) global_step = 0 # Now you train the model for epoch in range(config.num_epochs): progress_bar = tqdm(total=len(train_dataloader), disable=not accelerator.is_local_main_process) progress_bar.set_description(f"Epoch {epoch}") for step, batch in enumerate(train_dataloader): clean_images = batch["images"] # Sample noise to add to the images noise = torch.randn(clean_images.shape, device=clean_images.device) bs = clean_images.shape[0] # Sample a random timestep for each image timesteps = torch.randint( 0, noise_scheduler.config.num_train_timesteps, (bs,), device=clean_images.device, dtype=torch.int64 ) # Add noise to the clean images according to the noise magnitude at each timestep # (this is the forward diffusion process) noisy_images = noise_scheduler.add_noise(clean_images, noise, timesteps) with accelerator.accumulate(model): # Predict the noise residual noise_pred = model(noisy_images, timesteps, return_dict=False)[0] loss = F.mse_loss(noise_pred, noise) accelerator.backward(loss) if accelerator.sync_gradients: accelerator.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() lr_scheduler.step() optimizer.zero_grad() progress_bar.update(1) logs = {"loss": loss.detach().item(), "lr": lr_scheduler.get_last_lr()[0], "step": global_step} progress_bar.set_postfix(**logs) accelerator.log(logs, step=global_step) global_step += 1 # After each epoch you optionally sample some demo images with evaluate() and save the model if accelerator.is_main_process: pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler) if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1: evaluate(config, epoch, pipeline) if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1: if config.push_to_hub: upload_folder( repo_id=repo_id, folder_path=config.output_dir, commit_message=f"Epoch {epoch}", ignore_patterns=["step_*", "epoch_*"], ) else: pipeline.save_pretrained(config.output_dir)
启动训练
from accelerate import notebook_launcherargs = (config, model, noise_scheduler, optimizer, train_dataloader, lr_scheduler)notebook_launcher(train_loop, args, num_processes=1)
第6步:验证效果
import globsample_images = sorted(glob.glob(f"{config.output_dir}/samples/*.png"))Image.open(sample_images[-1])
训练50轮,时间太久了,测试我只测试了5轮,这里是我训练了5轮的效果
参考来源
友情提示
见原文:【Hugging Face】Hugging Face Diffusers的使用方式
本文同步自微信公众号 "程序员小溪" ,这里只是同步,想看及时消息请移步我的公众号,不定时更新我的学习经验。