Demystifying the CTGAN Loss Function

sdv-team · 2022-08-29T23:39:53Z

sdv-team
Aug 29, 2022
Maintainer

This article was researched by Santiago (Santi) Gomez, a DataCebo intern. Santi is rising Sophomore at BYU and an aspiring entrepreneur who spent his summer learning and experimenting with CTGAN.

The open source SDV library offers many machine learning models for creating synthetic data. One of the most popular is a GAN-based model called CTGAN (Conditional Tabular GAN) [1]. CTGAN uses deep learning to generate high-fidelity synthetic data, but it comes at the cost of greater complexity, as it is not always easy to understand how neural networks work.

In this article, we will look under the hood of our CTGAN model through a set of explorations, and answer the most frequently asked questions about it. We'll particularly focus on the loss function of CTGAN, which indicates how well it's able to learn from the real data.

What are GANs?

GANs, or generative adversarial networks, are algorithms that use two neural networks competing against each other (thus the “adversarial”) in order to generate synthetic data [2]. The two neural networks are known as the generator and the discriminator.

The discriminator's goal is to be able to tell apart real and synthetic data. Meanwhile, the generator's goal is to create high quality synthetic data that fools the discriminator. The overall setup is shown in the diagram below.

CTGAN is a type of GAN that's used for generating synthetic tabular data. You can run it from the open source SDV library.

Edit: Apr 14, 2023: This code snippet is out-of-date! In 2023, we introduced the new, SDV 1.0 library with improved API and workflows. To run CTGAN, please checkout the new CTGAN API docs and CTGAN demo.

from sdv.demo import load_tabular_demo
from sdv.tabular import CTGAN

data = load_tabular_demo('student_placements')

# use CTGAN to learn from the real data
model = CTGAN(verbose=True)
model.fit(data)

# create 1000 rows of synthetic data
synthetic_data = model.sample(num_rows=1000)

The CTGAN Loss Function

How do these GANs improve over time? The key to training are the loss functions, a set of formulas that tell each network how to improve after each training iteration (or epoch). The discriminator and generator each have their own loss values. Epoch after epoch, these networks learn by trying to minimize their loss function.

Loss functions are an area of active research, and many approaches have been proposed for different uses. In CTGAN, we have formulated custom loss functions for the purposes of creating synthetic data.

Here, x represents the real data and x' represents the synthetic data. Accordingly, D(x) is the discriminator's output given the real data and D(x') is for the synthetic. Finally, H is the cross entropy score and is always positive.

The discriminator is learning to produce low values if the data is synthetic and high values if it is real. The range of these values is dependent on linear transformations and dimensions of the input data. [1]

Here is an example of generator and discriminator loss values from CTGAN, varying over 1000 epochs.

FAQs

In this section, we'll answer some frequently asked questions about CTGAN and its loss values. The graphs in this section were produced by running the CTGAN model on datasets and graphing the loss values.

Q1: My generator loss is negative. Is that ok?

Yes. The generator and discriminator are both trying to minimize the loss value so a negative loss is actually a sign that the networks are working well! Remember that the generator loss consists of two components

the cross entropy (H), which can never be negative, and
-D(x'), which is negative if the discriminator assigns a high score to synthetic data, aka the synthetic data fools the discriminator

Because we are trying to minimize the loss function, the generator loss will generally tend to be negative over time. In our experiments, it was common for loss to be negative, especially the generator loss.

Q2: How should my loss functions look like?

When CTGAN performs well, both the generator and discriminator loss will eventually stabilize when they have run through enough epochs.

In many of our experiments, the generator loss started positive and eventually stabilized at a negative value while the discriminator loss oscillated around 0. This is shown below for a variety of different datasets.

At first, this might seem counterintuitive: Why is the discriminator oscillating at 0 if it is supposed to improve over time? Remember that the discriminator and the generator are adversaries. As the generator improves, it gets harder and harder for the discriminator to tell apart synthetic data from real data. So even if the discriminator loss oscillates at 0, this indicates improvement.

Many users have reported a similar pattern while modeling their datasets. This likely means that CTGAN is training correctly. Other patterns may be possible in the path to optimization, but be cautious if the loss values are not stabilizing, as it indicates the CTGAN is not able to effectively learn patterns in the real data. An example is shown below.

In this example, the loss values are not only failing to stabilize but they are actually getting noisier over time. If you see this pattern, you may need to update the parameters of CTGAN or your data itself might not be suitable for CTGAN.

Q3: How many epochs will result in the best model?

The short answer is that, since "best" is defined by the metrics used to evaluate the synthetic data, there is no one-size-fits-all stopping criterion.

Nevertheless, if your goal is to optimize for column distributions and correlation quality, our experiments suggest that a good time to stop is when the generator loss stabilizes at a negative value. We used the following metrics to measure quality:

CorrelationSimilarity, which measures how well the synthetic data captures correlations between numerical columns (e.g. a strong correlation between the "age" and "height" columns).
KSComplement, which uses the Kolmogorov–Smirnov test [3] to quantify the distance between numerical distributions
TVComplement, which computes the Total Variation Distance [4] to quantify the distance between categorical distributions.
CategoricalCoverage, which measures whether the synthetic data covers all the categories from the real data

All metrics are implemented in our SDMetrics library. They return a score where 0 means worst quality, and 1 means best.

The below graph shows a side-by-side comparison of how the loss values compare against the 4 quality metrics.

As shown in the experiments above, our metrics tend to stop improving right when the generator begins to stabilize at a negative value.

While our experiment looked at 4 particular quality metrics, the SDMetrics library provides a variety of other metrics suited for other goals and use cases. If you are actively exploring this space, we'd love to hear more about your findings!

Note that the current process of computing metrics per epoch is manual and requires CTGAN to start from the beginning. We have an open feature request for tracking metrics during training, which would make this process more efficient. To help us prioritize, please add your use cases to the feature request.

Q4: Does the number of epochs I should train for depend on the size of my dataset?

Not necessarily. From our explorations, there is no apparent correlation between the size of a dataset and the optimal number of epochs needed to train CTGAN. One reason for this might be the complexity of a dataset. A small dataset may contain patterns that are hard for the CTGAN to learn, requiring more epochs to train than a larger dataset.

Resources

Want to dig deeper? Our Colab Notebook includes the code for creating a CTGAN model, capturing the loss values and evaluating the synthetic data. Use it on the SDV demo data or your own custom datasets!

For this exploration, we used 6 datasets:

Libras, NBA_v1, RacketSports and KRK_v1 from the SDV demos
ds_salaries from Kaggle
kaggle-bank from Kaggle, [Moro et al., 2011] S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology.

I found a bug / have a feature request. What can I do?

Please file an issue on GitHub paying special attention to the version of the SDV. Code snippets and stack traces will help us debug. Any other info about your use case (what are you trying to achieve with synthetic data?) will help us prioritize new feature requests.

You can also ask questions and connect with the SDV community by joining our Slack!

References

[1] https://arxiv.org/pdf/1907.00503.pdf
[2] https://en.wikipedia.org/wiki/Generative_adversarial_network
[3] https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
[4] https://en.wikipedia.org/wiki/Total_variation_distance_of_probability_measures

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Demystifying the CTGAN Loss Function | Synthetic Data Modeling #980

{{title}}

{{editor}}'s edit