Skip to content

Commit

Permalink
docs: draft docs for advanced data preprocessing md
Browse files Browse the repository at this point in the history
Signed-off-by: Will Johnson <[email protected]>
  • Loading branch information
willmj committed Dec 20, 2024
1 parent 615ed74 commit 2cc74c1
Showing 1 changed file with 53 additions and 14 deletions.
67 changes: 53 additions & 14 deletions docs/advanced-data-preprocessing.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,73 @@
# Advanced Data Processing
Our library also supports a powerful data processing backed which can be used by the users to perform custom data preprocessing including
1. Providing multiple datasets
1. Support for multiple datasets
1. Creating custom data processing pipeline for the datasets.
1. Combining multiple datasets into one with even differnt formats.
1. Combining multiple datasets into one, even if they have different formats.
1. Mixing datasets as requried and sampling if needed each with different weights.

These things are supported via what we call a [`data_config`](#data-config) which can be passed an an argument to sft trainer. We explain data config in detail next,
These things are supported via what we call a [`data_config`](#data-config) which can be passed an an argument to sft trainer.

## Data Config

Data config is a configuration file which users can provide to sft trainer.py
Data config is a YAML configuration file which users can provide to `sft_trainer.py`, in this file they can include a variety of datasets and configurations. This data config is passed to `sft_trainer` through the `--data_config` flag.

What is data config schema
### What is data config schema
The data config schema is designed to define datasets and their processing strategies in a structured way. It consists of the following top-level keys:
- `datapreprocessor`: Defines global data processing parameters, such as the type (`default`), sampling stopping strategy (`all_exhausted` or `first_exhausted`), and sampling seed for reproducibility.
- `datasets`: A list of dataset configurations, each describing the dataset name, paths, optional builders, sampling ratios, and data handlers.
At the top level, the data config looks like this:
```yaml
datapreprocessor:
...
datasets:
...
```
How can use write data configs
### How the user can write data configs
Users can create a data config file in YAML format. The file should follow the schema outlined above with the following parameters:
What are data handlers
`datapreprocessor`:
- `type` (optional): Type of data preprocessor, `default` is currently the only supported type.
- `sampling_stopping_strategy` (optional): Stopping strategy, either `all_exhausted` or `first_exhausted`, defaults to `all_exhausted`.
- `sampling_seed` (optional): An int for reproducibility, defaults to 42.

Preexisting data handlers
`datasets`:
- `name`: A unique identifier for the dataset.
- `data_paths`: A list of file paths or directories containing the dataset.
- `builder` (optional): Specifies a Hugging Face dataset builder, if applicable.
- `sampling` (optional): A float representing the sampling ratio (0.0 to 1.0).
- `data_handlers` (optional): A list of data handler configurations.

Extra data handlers
For examples, see [predefined_data_configs](../tests/artifacts/predefined_data_configs/).

How can use pass the datasets
### What are data handlers
Data handlers are customizable components within the data config that allow users to preprocess or manipulate individual datasets. Each data handler has:
- `name`: The handler's unique identifier.
- `arguments`: A dictionary of parameters specific to the handler.

What kind of datasets can be passed
#### Preexisting data handlers
This library currently supports four preexisting data handlers:
- `tokenize_and_apply_input_masking`: Tokenizes input text and applies masking to the labels for causal language modeling tasks, good for input/output datasets.
- `apply_dataset_formatting`: Formats a dataset by appending an EOS token to a specified field.
- `apply_custom_data_formatting_template`: Applies a custom template (e.g., Alpaca style) to format dataset elements.
- `apply_tokenizer_chat_template`: Uses a tokenizer's chat template to preprocess dataset elements, good for single/multi turn chat templates.

How can user perform sampling
#### Extra data handlers
Users can define custom data handlers by implementing their logic and specifying their names and arguments in the data config.

### How the user can pass the datasets
`data_paths` can be either file and/or folder paths and can combine datasets with different formats as long as they have the same columns. Users can also use globbing patterns to pass in datasets following a specific regex pattern.

### What kind of datasets can be passed
The library supports datasets of type JSON, JSONL, Parquet, and Arrow. To see an up to date version of supported dataset types, see the `get_loader_for_filepath` function in [utils.py](../tuning/utils/utils.py).
Of these datatypes, the library supports pretokenized datasets and other datasets supported by data handlers, either preexisting or custom.

### How the user can perform sampling
- What does sampling means?
- Sampling allows users to specify a subset of the data to be processed. For example, a sampling ratio of 0.5 means 50% of the dataset will be used during training.
- How will it affect the datasets
- Sampling reduces the size of the dataset processed during training, which can speed up processing or focus on specific portions of the data.

How can user create a data config for the existing use cases.
### How the user can create a data config for the existing use cases.

Corner cases which needs attention.
### Corner cases which needs attention.

0 comments on commit 2cc74c1

Please sign in to comment.