Stable Diffusion: From Description to Visualization
Namaste, welcome to another short but informative post. Here, we will have a brief introduction to Stable Diffusion. We will not go into Maths and detailed architecture. It will be useful and enjoyable even if you have zero knowledge of Deep Learning. You will have practical examples of Stable Diffusion at last. Let’s start.
Introduction:
Stable Diffusion is based on the concept of “Super-Resolution”. With the help of Super-Resolution, we train a deep learning model, which can denoise an input image (noisy one) and generate a high-resolution image at the output. Deep learning models simulate the visual details that would most likely be provided as input using the distribution of their training data.
What happens if we simply run such a model on noise? The model would then begin to “denoise the noise” and begin having brand-new visual hallucinations. A small area of noise can be made into an artificial image with higher and higher resolution by repeatedly doing the technique. [1] This is the core idea of Latent diffusion.
Stable Diffusion:
We will move from Latent diffusion to a text-to-image system (Stable Diffusion). In Stable Diffusion, we have natural language text instead of a noisy image at the input. One essential element is still missing: the capacity to manage the generated visual output using the input text. Therefore, We concatenate the noise patch to the vector that represents the input text, then train the model on an image captioning dataset.
Figure 1 represents an overview of the Stable Diffusion model.
a) Text Encoder: It takes the user input text and converts it into a vector form. Here, we are passing “matar paneer” as the input.
b) Random Noise Generator: RNG helps to generate a noise of size N*N. We mix this noise with the vector representation of the input text.
c) Diffusion Model: It performs the denoising of the N*N image matrix in the loop (loop count set to 50), as shown in Figure 1.
d) Decoder: It converts the final latent image patch of N*N into a high-resolution output image of size M*M, where M > N.
Here, N = 64, and M = 512.
Execution Details:
You are not required to train a stable diffusion model. There are so many pre-trained models available for direct use. We will see one of them for the exemplary purpose. We use the Keras Vision library to execute the Stable Diffusion demo.
Step 01: Go to https://keras.io/guides/keras_cv/generate_images_with_stable_diffusion/
Step 02: Click on the “view in colab” link.
Step 03: After clicking on the colab link, you will be on the Google Colaboratory platform. Next, you just need to execute the following four cells one by one.
Note: Don’t forget to replace << input text >> with actual text input. For example: “A man sitting on the fish”. Try it and see the Magic.
I have executed the above code with some interesting textual input. I am sharing the same here.
- Input Text: “matar paneer”
2. Input text: “Shri Ram Mandir”
3. Input Text: “monkey in the river”
4. A more complex one: Try this in the 5th code cell.
Input Text: “cute magical flying elephant, fantasy art, “golden color, high quality, highly detailed, elegant, sharp focus, “concept art, sharp edge, digital painting, mystery, horror”
If you have reached here, then thanks for reading this short post. If you want to use GUI based tool, visit: https://huggingface.co/spaces/stabilityai/stable-diffusion link and try with some interesting and complex input text. Thanks again…
[1] https://keras.io/guides/keras_cv/generate_images_with_stable_diffusion