unit 6.1 - Generating images

From pixels to codes and back

Let us start with the simplest form of image generation we can think of. Imagine I have small images, say 32x32 or so. I could use a neural network “encoder” to turn an image into a 512-numbers code. I have no compressed the image into a small code, which is useful for many things, for example to communicate with fewer bits.

Now imaging I received this code and I want to reconstruct the original image. I need a neural network “decoder” to go from code to pixels.

The encoder and decoder here presented are a form of “auto-encoders”, which means that a input image or signal is converted into a smaller amount of data and then reconstructed.

From text and images to concepts

Generating images using neural networks usually involves converting a text description of the desired image into actual pixels. We here provide a summary of the most recent (as of 2024) techniques.

Since we want to use text to describe the desired output image, we need a way to connect images and their description or caption. This is a form of data correlation or joint-embedding where we want the image and its caption to be encoded into the same neural code. The neural code is just a short and compressed version of the text or image.

It helps to think of these neural codes as “concepts” in your mind. Ideas or concepts encode multi-modal data in a compact form that is used for reasoning and intelligent behavior. In our human brain we also have concepts: for example thing of the word “cat”, which tags the multi-modal concept ‘cat’ in our brain as a collection of visual appearance, tactile feeling, motions, and all characteristics we know of ‘cat’.

Here we show an example of joint modality encoding into a concept space. The text “a crab on black rocks” and an image of the same are encoded into the same 512-numbers vector.

One popular technique is CLIP. Another super easy way to do this is to add a trainable classifier the text encoder, and train it to match the encoding of images. For example one can uses the encoded image as “concept” space and add a 2-layer classifier to the text encoder (a CNN, for example). Training will make the classifier output the same concept.

After the text and image encoder has been trained, the way to generate pixels is to use the concept and project into pixel space. The concept contains an alignment to both textual content and its relation to images.

The image encoder path is not required, as it is only used to train the concept space. In actual pixel generation we use the text encoder projected into the concept space as vehicle to then drive image generation, as can be seen below.

From concepts to pixels?

How do you generate pixels from a concept? It requires the use of an image decoder, trained from an auto-encoder perspective. What does this mean? Recall we used an image encoder to encode concepts into the same space as the image captions.

The easiest way to do this is to train an image decoder as a stack of upsampling (transpose) convolutions. These modules can upsample images by creating pixels between pixels.

But there are many other techniques to generate pixels, including adversarial techniques which have training difficulties.

Clearly it is a difficult task for one neural network to create large images in one shot. That is why these techniques have proven effective in generating images with low pixel counts. Of course we like many more pixels in our images, many more than 512x512, ideally thousands of pixels. These methods are just not as effective at providing this level of details. What is needed is to use neural networks that can evolve an image in multiple steps. In recent times, (year 2024) one of the most popular technique to create large images is diffusion. These models learn to remove noise from an image in successive steps, using a U-Net neural network architecture.

Training of image generators

Training of image generators can be done by using the encoders for text and images and the decoder together.

Training of these system requires two steps:

first: train the CLIP encoders
second: train the image generator

The second step can be performed in this way:

encode a text prompt into a code
use the image decoder to produce a generated image from the code (code 1)
use the CLIP image encoder on the generated image and get its code (code 2)
back-propagate on the difference between code1 and 2

References

https://www.assemblyai.com/blog/how-dall-e-2-actually-works/

https://x.com/stanley_h_chan/status/1764827260115190075?s=20