-
Notifications
You must be signed in to change notification settings - Fork 48
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: draft docs for advanced data preprocessing md
Signed-off-by: Will Johnson <[email protected]>
- Loading branch information
Showing
1 changed file
with
53 additions
and
14 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,34 +1,73 @@ | ||
# Advanced Data Processing | ||
Our library also supports a powerful data processing backed which can be used by the users to perform custom data preprocessing including | ||
1. Providing multiple datasets | ||
1. Support for multiple datasets | ||
1. Creating custom data processing pipeline for the datasets. | ||
1. Combining multiple datasets into one with even differnt formats. | ||
1. Combining multiple datasets into one, even if they have different formats. | ||
1. Mixing datasets as requried and sampling if needed each with different weights. | ||
|
||
These things are supported via what we call a [`data_config`](#data-config) which can be passed an an argument to sft trainer. We explain data config in detail next, | ||
These things are supported via what we call a [`data_config`](#data-config) which can be passed an an argument to sft trainer. | ||
|
||
## Data Config | ||
|
||
Data config is a configuration file which users can provide to sft trainer.py | ||
Data config is a YAML configuration file which users can provide to `sft_trainer.py`, in this file they can include a variety of datasets and configurations. This data config is passed to `sft_trainer` through the `--data_config` flag. | ||
|
||
What is data config schema | ||
### What is data config schema | ||
The data config schema is designed to define datasets and their processing strategies in a structured way. It consists of the following top-level keys: | ||
- `datapreprocessor`: Defines global data processing parameters, such as the type (`default`), sampling stopping strategy (`all_exhausted` or `first_exhausted`), and sampling seed for reproducibility. | ||
- `datasets`: A list of dataset configurations, each describing the dataset name, paths, optional builders, sampling ratios, and data handlers. | ||
At the top level, the data config looks like this: | ||
```yaml | ||
datapreprocessor: | ||
... | ||
datasets: | ||
... | ||
``` | ||
How can use write data configs | ||
### How the user can write data configs | ||
Users can create a data config file in YAML format. The file should follow the schema outlined above with the following parameters: | ||
What are data handlers | ||
`datapreprocessor`: | ||
- `type` (optional): Type of data preprocessor, `default` is currently the only supported type. | ||
- `sampling_stopping_strategy` (optional): Stopping strategy, either `all_exhausted` or `first_exhausted`, defaults to `all_exhausted`. | ||
- `sampling_seed` (optional): An int for reproducibility, defaults to 42. | ||
|
||
Preexisting data handlers | ||
`datasets`: | ||
- `name`: A unique identifier for the dataset. | ||
- `data_paths`: A list of file paths or directories containing the dataset. | ||
- `builder` (optional): Specifies a Hugging Face dataset builder, if applicable. | ||
- `sampling` (optional): A float representing the sampling ratio (0.0 to 1.0). | ||
- `data_handlers` (optional): A list of data handler configurations. | ||
|
||
Extra data handlers | ||
For examples, see [predefined_data_configs](../tests/artifacts/predefined_data_configs/). | ||
|
||
How can use pass the datasets | ||
### What are data handlers | ||
Data handlers are customizable components within the data config that allow users to preprocess or manipulate individual datasets. Each data handler has: | ||
- `name`: The handler's unique identifier. | ||
- `arguments`: A dictionary of parameters specific to the handler. | ||
|
||
What kind of datasets can be passed | ||
#### Preexisting data handlers | ||
This library currently supports four preexisting data handlers: | ||
- `tokenize_and_apply_input_masking`: Tokenizes input text and applies masking to the labels for causal language modeling tasks, good for input/output datasets. | ||
- `apply_dataset_formatting`: Formats a dataset by appending an EOS token to a specified field. | ||
- `apply_custom_data_formatting_template`: Applies a custom template (e.g., Alpaca style) to format dataset elements. | ||
- `apply_tokenizer_chat_template`: Uses a tokenizer's chat template to preprocess dataset elements, good for single/multi turn chat templates. | ||
|
||
How can user perform sampling | ||
#### Extra data handlers | ||
Users can define custom data handlers by implementing their logic and specifying their names and arguments in the data config. | ||
|
||
### How the user can pass the datasets | ||
`data_paths` can be either file and/or folder paths and can combine datasets with different formats as long as they have the same columns. Users can also use globbing patterns to pass in datasets following a specific regex pattern. | ||
|
||
### What kind of datasets can be passed | ||
The library supports datasets of type JSON, JSONL, Parquet, and Arrow. To see an up to date version of supported dataset types, see the `get_loader_for_filepath` function in [utils.py](../tuning/utils/utils.py). | ||
Of these datatypes, the library supports pretokenized datasets and other datasets supported by data handlers, either preexisting or custom. | ||
|
||
### How the user can perform sampling | ||
- What does sampling means? | ||
- Sampling allows users to specify a subset of the data to be processed. For example, a sampling ratio of 0.5 means 50% of the dataset will be used during training. | ||
- How will it affect the datasets | ||
- Sampling reduces the size of the dataset processed during training, which can speed up processing or focus on specific portions of the data. | ||
|
||
How can user create a data config for the existing use cases. | ||
### How the user can create a data config for the existing use cases. | ||
|
||
Corner cases which needs attention. | ||
### Corner cases which needs attention. |