CLIP-Gen Overview

Examples of text-to-images generations of CLIP-GEN


Training a text-to-image generator in the general domain like DALL-E, GauGAN, and CogView requires huge amounts of paired text-image data, which can be problematic and expensive. In this paper, the authors propose a self-supervised scheme named CLIP-GEN for general text-to-image generation with the language-image priors extracted with a pre-trained CLIP model.


Only a set of unlabeled images in the general domain is required to train a text-to-image generator. First, the embedding of the image in the united language-vision embedding space is extracted with the CLIP encoder.

Next, the image is converted into a sequence of discrete tokens in the VQGAN codebook space (the VQGAN can be trained using unlabeled data).

Finally, an autoregressive transformer that maps the image tokens from its unified language-vision representation is trained. Once the training is complete, the transformer can generate coherent image tokens based on the text embedding extracted from the text encoder of CLIP upon an input text.

Such a strategy enables the training of a strong and general text-to-image generator with large text-free image datasets such as ImageNet. CLIP-GEN significantly outperforms optimization-based text-to-image methods in terms of image quality while not compromising text-image matching.

CLIP-GEN image processing, training, and inference pipelines.

a) Pipeline for mapping a sentence to the corresponding image through the embedding space and the token space.

b) Training and testing pipeline.

During training, the pre-trained CLIP model embeds the image to a cross-modality embedding and the pre-trained image tokenizer encodes the image into discrete image tokens.

The autoregressive transformer learns to predict the image tokens with respect to the cross-modality embedding. During the inference, the CLIP model could either take an image or a sentence as the input, and then the transformer could predict coherent image tokens semantically related to the input.


The model is made up of three components:

  • a pre-trained language-image matching model (CLIP),
  • an image tokenizer (VQ-GAN)
  • and a conditional autoregressive transformer that takes the image embedding of an image extracted from CLIP as a certain condition, and then generates the discrete image tokens of the same image.
3 phases of the CLIP-GEN framework.

CLIP loss

Contrastive Language–Image Pre-training (CLIP) has achieved great success in mapping the language-image inputs to a common embedding space. Given an image I or a sentence T as the input (denoted as x), the CLIP model can embed them into a common representation space:

VQ-GAN as an Efficient Image Tokenizer

VQ-GAN is used to learn a perceptually rich codebook Z by optimizing all parameters of the encoder E, decoder G, and discriminator D. After the training finishes, the discriminator is removed.

The encoder and the codebook are used as the image tokenizer, and the decoder is used for reconstructing an image from its tokens.

The VQ-GAN model can be optimized with an objective consisting of the reconstruction loss:

VQ and Adversarial loss functions.

Conditional Autoregressive Transformer

The conditional autoregressive transformer is designated to predict image tokens based on its CLIP embedding.

Given an input image, its embedding is obtained with the CLIP image encoder and a row-major ordered sequence of image tokens. Since the CLIP model only extracts high-level semantic information of an image, the low-level image information of the image could be restored with the transformer in an autoregressive way, just as:

Once the complete set of tokens is restored with respect to the image embedding, the pre-trained decoder could reconstruct the tokens back to an image.

Training strategy

First Stage

A VQ-GAN model is first trained with the image dataset in a self-supervised manner. All elements, i.e. encoder, decoder, generator, and codebook, will be optimized during training.

The training objective looks as follows:

Second Stage

The conditional autoregressive transformer is trained at this stage. Since the input-output data was paired (embedding ⇒ image), the objective is a sum of the embedding reconstruction loss and a loss to maximize the likelihood of the corresponding image token.

The reconstruction loss implemented in CLIP

Likelihood maximization loss

The training objective is the weighted combination of the two losses above, where Lambda is equal to 0.2 in the authors’ implementation.

Training and evaluation

Two datasets were used for benchmarking — ImageNet and MS-COCO.

Different layer, codebook, and their embedding dimension sizes of 2 models.

For both datasets, a VQGAN with a codebook dimension size of 16384 and a codebook embedding dimension size of 256 was trained. The GPT2 was used as the architecture of the conditional transformer. 24-layer GPT-2 medium was used for MS-COCO, and a 48-layers GPT2-XL — for ImageNet. The CLIP backbone used in the experiments was ViT-B/32.

Evaluation metrics

To evaluate the quality of generated images, the standard metrics as Inception Score (IS), Fr`echet Inception Distance (FID), and CapS were used.

IS calculates KL-divergence between conditional distribution and marginal distribution given by an image classifier.

FID computes the Fr`echet distance between the distribution of Inception features of synthetic images and real-world images.

CapS measures the semantic similarities between the input text and the generated image.

CLIP-GEN achieves the best FID-0 and FID-1 due to the perceptually rich results generated by VQGAN and coherent image structures.

The CapS score is lower than CogView by 4% but significantly better than other competing methods.

Results and examples

To examine the generalization ability of our method, the images were attempted to be generated under out-of-distribution language descriptions. Some of the descriptions (e.g. “a dog with a cigarette”, “a lemon with hair and face”) do not even have a corresponding real-world image. CLIP-GEN is trained on the realistic images and can generate the images well-aligned with these out-of-distribution texts surprisingly well.

On the other hand, CogView that is trained upon amounts of images with textual labels fails to match those decorative words (e.g., “flying”, “with a cigarette”, “with a big beak”).

The authors also explore the generalization ability of the method in terms of stylized synthesis. They attempt to generate images under special style descriptions (e.g., “sketch”, “oil painting” or even “style of Edvard Munch”).

As shown below, the model can successfully synthesize stylized pictures even without seeing many stylized training samples as no style augmentation is applied during training.

Reference literature

  1. CLIP paper
  2. VQ-GAN paper
  3. More on Fr`echet Inception Distance (FID) —
  4. CogView and CapS metric —




AI Researcher, ML Engineer @Infopulse. MSc Applied Maths at NaUKMA. Find me at

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Can machines think?

Uplink channel mappings

The Basics: Logistic Regression and Regularization

Using AI and machine learning to find clues for journalists

Vectorhub — The one library for Vectorization

Serving Python Machine Learning Models With Ease

Data Quality in Machine Learning: How to Evaluate and Improve?

Machine Learning Facts — 1

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Dmytro Kuzmenko

Dmytro Kuzmenko

AI Researcher, ML Engineer @Infopulse. MSc Applied Maths at NaUKMA. Find me at

More from Medium

Electron Refraction & Reflection for Synchronous Oscillatory Systems

Practical Pruning of Neural Networks with Intel Neural Network Distiller

Character-Centered Video Story Understanding with Hierarchical QA

An example of DramaQA dataset which contains video clips, scripts, and QA pairs with levels of difficulty