If generating unbelievably good images from plain text wasn’t already enough, the model can now generate higher resolution images and can do really fancy stuff like use depth maps to produce novel images. (View Highlight)
The U-Net in stable diffusion takes encoded text (plain text processed into a format it can understand) and a noisy array of numbers as inputs. Over many iterations, it produces an array containing imageable information from this noisy array it received. The output of the U-Net is then used by another network called a decoder to create the pictures we all see as output. (View Highlight)