Since the release of CLIP by OpenAPI, multiple applications of this multi-modal model have been made, including StyleCLIP. StyleCLIP is a combination of high-resolution image generator - StyleGAN and text-image connecter - CLIP. By measuring cosine similarities of text vector generated by CLIP and image vector generated by StyleGAN, StyleCLIP makes it possible to conveniently manipulate an image with a text prompt.
We further extended the benefits of StyleCLIP by implementing Multilingual-CLIP to this model. Multilingual-CLIP consists of two encoders: an image encoder and a fine-tuned text encoder that is capable of encoding any language. Thus, our version of StyleCLIP manipulates an image not only with an English text prompt, but also with a text prompt in any other language, for example in Korean.
Accuracy of image encoding task also has increased. Official image encoder in StyleCLIP is Encoder4Encoding(e4e) which plays its role when training and testing. However empirically we found out that the result of e4e is quite different from the original input image. To overcome this issue, we encoded data sets for training a mapper and images for inference with Restyle Encoder. Restlye Encoder which was introduced in the paper “ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement (ICCV 2021)” iteratively self-corrects the inverted latent code, resulting in increased accuracy.
This repository contains:
- Pytorch training code for Multilingual Latent Optimizatizer, Latent Mapper, Global Direction
- Pytorch inference code Multilingual Latent Optimizatizer, Latent Mapper, Global Direction
- latent mapper, global direction weights
- CelebA-HQ Dataset latents (encoded via Restlye)
- Restlye encoder applied over pSp pretrained on the FFHQ dataset
- Huggingface available transformer M-BERT Base ViT-B
- CLIP
- StyleGAN2
The experiment was done in following conditions:
- Python 3.7.12
- Torch 1.10.0+cu11
- Google Colab
The code relies on Rosinality pytorch implementation of StyleGAN2. Facial Recognition weights and pretrained restyle encoder are to be downloaded here.
- --description is for the driving text (can be in any language).
- To control the manipulation effect, adjust l2 lambda and ID lambda parameters
Given a textual description, one can both edit a given image, or generate a random image that best fits to the description. Both operations can be done through the main.py script, or the optimization_playground.ipynb notebook .
The code relies on Rosinality pytorch implementation of StyleGAN2.
- This repository trains the mapper with dataset that was inverted by e4e encoder instead of restyle encoder.
- Inferencing on restyle encoder-inverted images works just fine.
- e4e encoder-inverted dataset is located in the original StyleClip repository.
- To resume a training, provide --checkpoint_path.
- --description is for the driving text (can be in any language).
- To control the manipulation effect, adjust l2 lambda and ID lambda parameters
- Takes up 10 hours for proper training
!python models/mapper/scripts/train.py --exp_dir exp_dir --no_fine_mapper --description "보라색 머리카락을 가진 사람" \
--latents_train_path data/celebA/train_faces.pt --latents_test_path data/celebA/test_faces.pt \
- For inference, we provide several pretrained mappers (text prompt in Korean language)
-
- google drive links for pretrained weights:
The code relies on the official TensorFlow implementation of StyleGAN2. Facial Recognition weights and pretrained restyle encoder are to be downloaded here.
Open the notebook in colab and run all the cells.
In the last cell you can play with the image. beta corresponds to the disentanglement threshold, and alpha to the manipulation strength. After you set the desired set of parameters, please run again the last cell to generate the image.
Images below are from celebA-HQ, and were inverted into latent space via Restyle Encoder.
-
Compare results in other languages : English, Korean, Chinese, Spanish
-
Original