Multilingual-StyleCLIP

Global direction Notebook:
Latent optimization Notebook:
Latent mapper Notebook:

Overview

Since the release of CLIP by OpenAPI, multiple applications of this multi-modal model have been made, including StyleCLIP. StyleCLIP is a combination of high-resolution image generator - StyleGAN and text-image connecter - CLIP. By measuring cosine similarities of text vector generated by CLIP and image vector generated by StyleGAN, StyleCLIP makes it possible to conveniently manipulate an image with a text prompt.

We further extended the benefits of StyleCLIP by implementing Multilingual-CLIP to this model. Multilingual-CLIP consists of two encoders: an image encoder and a fine-tuned text encoder that is capable of encoding any language. Thus, our version of StyleCLIP manipulates an image not only with an English text prompt, but also with a text prompt in any other language, for example in Korean.

Accuracy of image encoding task also has increased. Official image encoder in StyleCLIP is Encoder4Encoding(e4e) which plays its role when training and testing. However empirically we found out that the result of e4e is quite different from the original input image. To overcome this issue, we encoded data sets for training a mapper and images for inference with Restyle Encoder. Restlye Encoder which was introduced in the paper “ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement (ICCV 2021)” iteratively self-corrects the inverted latent code, resulting in increased accuracy.

This repository contains:

Pytorch training code for Multilingual Latent Optimizatizer, Latent Mapper, Global Direction
Pytorch inference code Multilingual Latent Optimizatizer, Latent Mapper, Global Direction
latent mapper, global direction weights
CelebA-HQ Dataset latents (encoded via Restlye)
Restlye encoder applied over pSp pretrained on the FFHQ dataset
Huggingface available transformer M-BERT Base ViT-B
CLIP
StyleGAN2

Setup

The experiment was done in following conditions:

Python 3.7.12
Torch 1.10.0+cu11
Google Colab

Latent optimization

The code relies on Rosinality pytorch implementation of StyleGAN2. Facial Recognition weights and pretrained restyle encoder are to be downloaded here.

--description is for the driving text (can be in any language).
To control the manipulation effect, adjust l2 lambda and ID lambda parameters

Usage

Given a textual description, one can both edit a given image, or generate a random image that best fits to the description. Both operations can be done through the main.py script, or the optimization_playground.ipynb notebook .

Latent mapper

The code relies on Rosinality pytorch implementation of StyleGAN2.

training

This repository trains the mapper with dataset that was inverted by e4e encoder instead of restyle encoder.
Inferencing on restyle encoder-inverted images works just fine.
e4e encoder-inverted dataset is located in the original StyleClip repository.
To resume a training, provide --checkpoint_path.
--description is for the driving text (can be in any language).
To control the manipulation effect, adjust l2 lambda and ID lambda parameters
Takes up 10 hours for proper training

!python models/mapper/scripts/train.py --exp_dir exp_dir --no_fine_mapper --description "보라색 머리카락을 가진 사람" \
--latents_train_path data/celebA/train_faces.pt --latents_test_path data/celebA/test_faces.pt \

Inference

For inference, we provide several pretrained mappers (text prompt in Korean language)
- google drive links for pretrained weights:
- "a person with earings" in french

Global Direction

The code relies on the official TensorFlow implementation of StyleGAN2. Facial Recognition weights and pretrained restyle encoder are to be downloaded here.

Usage

Open the notebook in colab and run all the cells.

In the last cell you can play with the image. beta corresponds to the disentanglement threshold, and alpha to the manipulation strength. After you set the desired set of parameters, please run again the last cell to generate the image.

Editing Examples

encoder results comparison

Images below are from celebA-HQ, and were inverted into latent space via Restyle Encoder.

Latent optimization

Compare results in other languages : English, Korean, Chinese, Russian
Original

Text prompt "a person with purple hair"

Latent mapper

Compare results in other languages : English, Korean, Russian, Japanese
Original

Text prompt "a child"

Global Direction

text prompt "a smiling face" and "man's face" in Korean.
Compare results in other languages : English, Korean, Chinese, Spanish
Original

Text prompt "a smiling face"
Text prompt "a male face"

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
criteria		criteria
data/npy		data/npy
image		image
models		models
.gitignore		.gitignore
README.md		README.md
get-weights.sh		get-weights.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multilingual-StyleCLIP

Overview

Setup

Latent optimization

Usage

Latent mapper

training

Inference

Global Direction

Usage

Editing Examples

encoder results comparison

Latent optimization

Latent mapper

Global Direction

Acknowledgement

About

Releases

Packages

Contributors 2

Languages

esoyeon/Multilingual-StyleCLIP

Folders and files

Latest commit

History

Repository files navigation

Multilingual-StyleCLIP

Overview

Setup

Latent optimization

Usage

Latent mapper

training

Inference

Global Direction

Usage

Editing Examples

encoder results comparison

Latent optimization

Latent mapper

Global Direction

Acknowledgement

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages