Metadata
- Author: huggingface.co
- Full Title: Extending Transformer Layers as Painters to DiT’s
- URL: https://huggingface.co/blog/NagaSaiAbhinay/transformer-layers-as-painters-dit
Highlights
- The motivation for this experiment comes from “Transformer layers as Painters”[1]. by Sakana AI and Emergence AI who suggest the existence of a common representation space among the layers of an LLM because of the residual connections. (View Highlight)
- Transformer layers or MM-DiT layers as referred to here have two streams for dealing with text embeddings and image embeddings seperately while also having a joint attention mechanism. (View Highlight)
- Single layers or Joint layers deal with encoder embeddings and image embeddings together a.k.a single flow blocks in Flux arch listed above. (View Highlight)
- Grouping of the layers based on cosine similarity is done into first layers, middle layers and last layers as in the paper. (View Highlight)
- The following layer execution strategies are used for the experiment: (View Highlight)
- Flux shows the most prominent grouping (based on activation cosine similarity) of layers indicating the possibility of a common representation space followed by AuraFlow. But all 3 models do show grouping indicating a common representation space. (View Highlight)
- The layers before and after a group of layers seem to act as ‘translation’ layers, converting the model representation from one space to another. This is evidenced by the fact that removing preceding layers is catastrophic. (View Highlight)
- Skipping some layers from a group of layers degrades image quality the least compared to other methods. This is in line with the finding of the paper. (View Highlight)
- Repeating the same layer from a group is the worst (apart from removing the so called ‘translation’ layers which don’t belong to the group anyway) (View Highlight)
- Repeatedly executing the layers in parallel and averaging their outputs is not catastrophic for layers that give prompt adherence but catastrophic for layers that deal with aesthetic quality. Same with reversing the middle layers. (View Highlight)
- Flux architecure has two different transformer blocks. The one referred to here as transformer block/layer is a MMDiT block which has two streams, one for encoder hidden states and one for hidden states. The single transformer block/layer is single stream which acts on hidden states. See the architecture[3] (View Highlight)
- • Skipping the first MM-DiT block is not catastrophic but shows the role it plays in converting (translation) the prompt to the representation space. And the last layer converts (translation) it to the space of the single layers. • Skipping MM-DiT layers from the middle group affects the finer details of the image while retaining the broad concepts of the prompt. (pink glasses are present but on dog. Robot is no longer made of felt etc…) (View Highlight)
- Skipping single layers affects the visual quality. There are two distinct middle layer groupings in the Flux single layers. The first seems to be responsible for building the structural layout and broad details whereas the following group deals with finer details. (View Highlight)
- Skipping single layers preceding the middle group affects the aesthetics and results in visual hallucinations: (multiple instances of same subject. eg: multiple parrots) and missing details (bridge) which can indicate incorrect ‘translation’ of the prompt to details. (View Highlight)
- Repeating the same layer multiple times is catastrophic. Paper theorizes that this is because it pushes the data out of distribution from what the model has trained to handle. (View Highlight)
- Reversing MM-DiT layers retains some concepts from the prompt but details are completely lost. Reversing single layers is catastrophic. (View Highlight)
- Executing the middle layers in parallel and averaging their outputs is catastrophic. (View Highlight)
- Based on this distinction of layers and the roles they seemingly play, a natural question is how would applying LoRA to specific layers affect the training and image generation during inference ? Edit: See https://x.com/__TheBen/status/1829554120270987740 It actually makes a difference! You don’t need to train LoRA’s on all layers. (View Highlight)