Stable Diffusion 3.5 Large Fine-Tuning Tutorial

rw-book-cover

Metadata

Author: Notion
Full Title: Stable Diffusion 3.5 Large Fine-Tuning Tutorial
URL: https://stabilityai.notion.site/Stable-Diffusion-3-5-Large-Fine-tuning-Tutorial-11a61cdcd1968027a15bdbd7c40be8c6

Highlights

(View Highlight)
Most of this guide is catered towards training LoRA models on SD3.5 Large. However, I also provide configurations that will allow you to perform full fine-tuning (with the help of DeepSpeed). This is still pretty experimental, but I can at least help you get the training up and running. The environment setup and training script should remain agnostic of which model (SD3.5 M/L) is being trained. (View Highlight)
Once again, out of the tools available, I’ve chosen to go with SimpleTuner toolkit from bghira (developer of SimpleTuner) as it gave me the best results. As such, I won’t be covering tools from kohya-ss (sd-scripts), Nerogar (OneTrainer), or Hugging Face (diffusers). (View Highlight)
Just for reference, SimpleTuner uses the diffusers library as a backend, and that’s how I was able to fine-tune SD3.5 Large using a recent commit of SimpleTuner and a custom version of diffusers. (View Highlight)
If you need help automatically pre-cropping your images, this is a lightweight, barebones script I wrote to do it. It will find the best crop depending on:
1. Is there a human face in the image? If so, we’ll do the cropping oriented around that region of the image.
2. If there is no human face detected, we’ll do the cropping using a saliency map, which will detect the most interesting region of the image. Then, a best crop will be extracted centered around that region. (View Highlight)
I’ll be showing results from several fine-tuned LoRA models of varying dataset size to show that the settings I chose generalize well enough to be a good starting point for fine-tuning LoRA. (View Highlight)
repeats duplicates your images (and optionally rotates, changes the hue/saturation, etc.) and captions as well to help generalize the style into the model and prevent overfitting. While SimpleTuner supports caption dropout (randomly dropping captions a specified percentage of the time), it doesn’t support shuffling tokens (tokens are kind of like words in the caption) as of this moment, but you can simulate the behavior of kohya’s sd-scripts where you can shuffle tokens while keeping an n amount of tokens in the beginning positions. Doing so helps the model not get too fixated on extraneous tokens. (View Highlight)
There are 476 images in the fantasy art dataset. Add on top of the 5 repeats from multidatabackend.json . I chose a train_batch_size of 6 for two reasons:
1. This value would let me see the progress bar update every second or two.
2. It’s large enough in that it can take 6 samples in one iteration, making sure that there is more generalization during the training process. (View Highlight)
If I wanted 30 or something epochs, then the final calculation would be this: Max training steps=(476×56)×30\text{Max training steps} = \left(\frac{\text{476} \times \text{5}}{\text{6}}\right) \times \text{30}Max training steps=(6476×5)×30 This equals 11,900 steps, more or less. (View Highlight)
Personally, I received very satisfactory results using a higher LoRA rank and alpha. You can watch the more recent videos on my YouTube channel for a more precise heuristic breakdown of how image fidelity increases the higher you raise the LoRA rank (in my opinion). (View Highlight)
Anyway, If you don’t have the VRAM, storage capacity, or time to go so high, you can choose to go with a lower value such as 256 or 128 . (View Highlight)
(View Highlight)
With the settings of batch size of 6 and a lora rank/alpha of 768, the training consumes about 32 GB of VRAM. (View Highlight)
Understandably, this is out of the range of consumer 24 GB VRAM GPUs. As such, I tried to decrease the memory costs by using a batch size of 1 and lora rank/alpha of 128 . (View Highlight)
To be safe, you might have to decrease the lora rank/alpha even further to 64. If so, you’ll consume around 18.83 GB of VRAM during training. (View Highlight)
These are the figures that I received from my fantasy art LoRA training. Loss is decreasing and hasn’t converged yet. However, if you have some experience with fine-tuning diffusion models, minimizing loss has almost nothing to do with maximizing aesthetics. Also, I noticed that near the peaks of the loss curve, pixelation or degradation in validation images may occur, if using a high learning rate. This makes sense as training reaches a learning rate that the model weights aren’t comfortable with. (View Highlight)
(View Highlight)
(View Highlight)
Once you’ve found the LoRA checkpoint that gives you the best aesthetic results, you can further improve it with APG scaling. APG scaling stands for adaptive projected guidance. (View Highlight)
Our approach, termed adaptive projected guidance (APG), retains the quality-boosting advantages of CFG while enabling the use of higher guidance scales without oversaturation. APG is easy to implement and introduces practically no additional computational overhead to the sampling process. (View Highlight)
(View Highlight)
(View Highlight)
(View Highlight)
(View Highlight)
(View Highlight)
(View Highlight)
A commonly believed heuristic that we verified once again during the construction of the SD3.5 family of models is that later/higher layers (i.e. 30 - 37)* impact tertiary details more heavily. Conversely, earlier layers (i.e. 12 - 24 )* influence the overall composition/primary form more. (View Highlight)
In preliminary testing, we observed that freezing the last few layers of the architecture significantly improved model training when using a photorealistic dataset, preventing detail degradation introduced by small dataset from happening. (View Highlight)
To dampen any possible degradation of anatomy, training only the attention layers and not the adaptive linear layers could help. For reference, below is one of the transformer blocks. (View Highlight)