Deciphering DALL-E

Rahul Prabhu
3 min readOct 17, 2022

OpenAI’s DALL-E image generator model has been out for a few months, and the results have been breathtaking. When fed a short text prompt, which can be as non-sensible as “Teddy bears working on new AI research underwater with 1990s technology”, it can generate completely new images which contain plausible depictions of the nonsense in the prompt:

Dall-E has three basic steps: It inputs a text prompt into a text encoder which represents this prompt in a text-mapping space. Next, it takes this text encoding and maps a corresponding image encoding based on the semantic information matching. Finally, this image encoding is decoded to produce an image that adequately captures the semantic information of the encoding (semantic means the meaning — the interpretation of the information).

Encoding text

CLIP (Contrastive Language-Image Pre-training), a previous OpenAI model, connects textual and visual semantics. It learns how closely related an image caption is to the given image, rather than trying to predict a caption to an image. This kind of approach is called contrastive linguistics and is the reason why CLIP can make connections between text and visuals.

CLIP works in three steps: All images and corresponding captions are passed through encoders which map them into a multi-dimensional space. The Cosine similarity of all image and text pairs is calculated. Finally, the model trains, trying to maximize the cosine similarity between correct encoded pairs and minimize the similarity between incorrect pairs.

CLIP creates a vector space where both images and text are dimensions. This kind of translatable dictionary allows it to predict a correct caption for an image from a list of given captions, rather than just identify a category to which the image belongs.

Cosine similarity is a method to, as the name suggests, measure similarity. If there are two vectors in an inner product space, the cosine similarity is simply the cosine of the angle between them — their dot product divided by their magnitudes. If we represent the vector pair as a text-image pair, the cosine similarity will measure how alike the two are.

DALL-E uses the CLIP model to determine the relationship between the input text and the visual concept that it will output.

Mapping image-to-text encoding

CLIP provides us with a representational space in which DALL-E can determine the closeness in the relationship between a text and image pair, but our task is image generation. One of OpenAI’s previous models, GLIDE, is used to exploit this space and create images. The GLIDE model inverts the image encoding process and therefore can produce images from image encodings which CLIP denotes as semantically similar to the text encoding generated from the input. The image produced by GLIDE maintains the important salient features of the original images CLIP used to train.

GLIDE uses a diffusion model, a concept borrowed from thermodynamics. It reverses a noising process: a Markov chain that gradually adds noise to images until the results in Gaussian noise. (A Markov chain describes a sequence of possible events, where each event’s probability depends on the previous state which was reached.) Gaussian noise is statistical noise in an image where the probability density function is equal to the Gaussian Distribution.

GLIDE randomly samples Gaussian noise and then de-noises it to generate an image. In order to ensure that the images produce specific features, in this case, the salient aspects of the text encoding, DALL-E uses vectors from the CLIP space to calculate which pixels to modify in every step.

DALL-E uses both the GLIDE and CLIP models to generate new images from new input texts.