Synthetic Tabular Data Generation Library
This Python library specializes in the generation of synthetic tabular data. It has a diverse range of statistical, Machine Learning (ML) and Deep Learning (DL) methods to accurately capture patterns in real datasets and replicate them in a synthetic context. It has multiple applications including pre-processing of tabular datasets, data balancing, resampling...
🔩 Pre-process your data.
🕜 State-of-the-art models.
♻️ Easy to use and customize.
The gentab
library is available using pip. We recommend using a virtual environment to avoid conflicts with other software on your machine.
pip install gentab
Below is the list of the generators currently available in the library.
Model | Example | Paper |
---|---|---|
SMOTE | link | |
ADASYN | link |
Model | Example | Paper |
---|---|---|
Gaussian Copula | link |
Model | Example | Paper |
---|---|---|
TVAE | link |
Model | Example | Paper |
---|---|---|
CTGAN | link | |
CTAB-GAN | link | |
CTAB-GAN+ | link |
Model | Example | Paper |
---|---|---|
ForestDiffusion | link |
Model | Example | Paper |
---|---|---|
GReaT | link | |
Tabula | link |
Model | Example | Papers |
---|---|---|
Copula GAN | link link | |
AutoDiffusion | link |
from gentab.generators import AutoDiffusion
from gentab.evaluators import MLP
from gentab.data import Config, Dataset
from gentab.utils import console
config = Config("configs/playnet.json")
dataset = Dataset(config)
dataset.reduce_size(
{
"left_attack": 0.97,
"right_attack": 0.97,
"right_transition": 0.9,
"left_transition": 0.9,
"time_out": 0.8,
"left_penal": 0.5,
"right_penal": 0.5,
}
)
dataset.merge_classes(
{
"attack": ["left_attack", "right_attack"],
"transition": ["left_transition", "right_transition"],
"penalty": ["left_penal", "right_penal"],
}
)
dataset.reduce_mem()
console.print(dataset.class_counts(), dataset.row_count())
generator = AutoDiffusion(dataset)
generator.generate()
console.print(dataset.generated_class_counts(), dataset.generated_row_count())
evaluator = MLP(generator)
evaluator.evaluate()
dataset.save_to_disk(generator)
from gentab.generators import AutoDiffusion
from gentab.evaluators import LightGBM
from gentab.tuners import AutoDiffusionTuner
from gentab.data import Config, Dataset
config = Config("configs/adult.json")
dataset = Dataset(config)
dataset.merge_classes({
"<=50K": ["<=50K."], ">50K": [">50K."]
})
dataset.reduce_mem()
generator = AutoDiffusion(dataset)
evaluator = LightGBM(generator)
trials = 10
time = 60 * 60 * 8
tuner = AutoDiffusionTuner(evaluator, trials, timeout=time)
tuner.tune()
tuner.save_to_disk()
from gentab.generators import AutoDiffusion
from gentab.evaluators import LightGBM
from gentab.tuners import AutoDiffusionTuner
from gentab.data import Config, Dataset
config = Config("configs/adult.json")
dataset = Dataset(config)
dataset.merge_classes({
"<=50K": ["<=50K."], ">50K": [">50K."]
})
dataset.reduce_mem()
# Load previously saved dataset...
generator = AutoDiffusion(dataset)
generator.load_from_disk()
# Do work with previously generated but not tuned dataset...
evaluator = LightGBM(generator)
evaluator.evaluate()
evaluator.evaluate_baseline()
# Load previously tuned and saved dataset...
tuner = AutoDiffusionTuner(evaluator, 0)
tuner.load_from_disk()
# Do work with previously tuned dataset...
evaluator.evaluate()
evaluator.evaluate_baseline()