We support multiple common data schemas and here are a few examples with corresponding configuration files. You may find the "nearest match" to start with.
Note: across all examples, iteration
are set to a small number to ensure a quick E2E test. For generating high-quality synthetic data, we recommend increasing iteration
by your experience and computational resources.
We support four different fields:
-
Bit field (encoded as bit strings) e.g.,
{ "column": "srcip", "type": "integer", "encoding": "bit", "n_bits": 32 }
An optional property to this field is
truncate
, which is a boolean value with defaultFalse
. Iftruncate
is set totrue
, then we will truncate large integers and consider only the most significantn_bits
bits. -
Word2Vec field (encoded as Word2Vec vectors), e.g.,
{ "column": "srcport", "type": "integer", "encoding": "word2vec_port" }
-
Categorical field (encoded as one-hot encoding), e.g.,
{ "column": "type", "type": "string", "encoding": "categorical" }
-
Continuous field, e.g.,
{ "column": "pkt", "type": "float", "normalization": "ZERO_ONE", "log1p_norm": true }
Single-event schema contains one timeseries per row.
Timestamp (optional) | Metadata 1 | Metadata 2 | ... | Timeseries 1 | Timeseries 2 | ... |
---|---|---|---|---|---|---|
t1 | ||||||
t2 | ||||||
... |
-
PCAP
Timestamp Srcip Dstip Srcport Dstport Proto Pkt_size ... t1 t2 ... -
NetFlow (configuration_file)
Multi-event data schema contains multiple timeseries per row.
Metadata 1 | Metadata 2 | ... | {Timestamp (optional), Timeseries 1, Timeseries 2, ...} | {Timestamp (optional), Timeseries 1, Timeseries 2, ...} | ... |
---|---|---|---|---|---|
- Wikipedia dataset (configuration_file)
Domain Access type Agent {Date 1, page view} {Date 2, page view} ...