This project aims to generate galaxy images based on data provided by the Hubble Space Telescope (HST). To do so, we are implementing an unsupervised machine learning technique called a Variational Autoencoder (Vae). The trained Vae model allows us to decode a random latent variable
We used two datasets :
- 47 955 galaxies from Hubble's famous Deep Field image (the images have
128$\times$ 128 pixels) - 81 499 galaxies and their associated redshifts from the Cosmic Survey (the images have 158
$\times$ 158 pixels)
For each dataset, we developped a
Based on Kihyuk Sohn's paper, we even implemented another version on the second dataset conditioned on the redshifts of each galaxy. In the end, our conditional vae is able to generate galaxy structures for a specific redshift. We can even do an interpolation of the same galaxy structure for different redshifts:
In the folder Models architecture
, you will find the details of the different models used.
First, two disentangled Vae models (one per dataset) with almost the same architecture (just a few changes made to the convolutional layers arguments due to the image size difference for each dataset). The models can take as input either a value or an array for the hyperparameter
Then, three Conditional Vae models (for the second dataset):
-
cvae
: new input created by concatenation of the redshifts to the galaxy images into a second channel which is fed to the CNN. Then, we concatenate the redshifts to the latent variable$z$ into a second channel. The final output is the reconstructed galaxy image.
-
cvae2
: concatenation of the redshifts to the output of the encoder's CNN and to the latent variable$z$ before decoding.
fancy_cvae
: similar tocvae
but the final output is a prediction of the galaxy images and the redshifts.
The performance of these architectures is really similar. The unique noteworthy difference is the training time which is shorter by 20 sec/epoch for cvae2
compared to the others, so I would recommend using the cvae2
architecture for conditioned galaxy image generation.
In the folder notebooks
, you will find all the code related to each model's training and the evaluation of its performance:
- Loss
- Image reconstruction
- Image generation
- Latent space visualization
Currently, we are generating images by giving random samples of a Gaussian
Training our model on a bigger dataset (314 000 galaxy images instead of 81 500).
We could maybe improve the results with some fine-tuning of the hyperparameters of the model or with different machine learning approaches that are not implemented yet (e.g. learning rate decay).