From d19b8ec4551e4a769869697438586396eef7b6c6 Mon Sep 17 00:00:00 2001 From: Dushyant Behl Date: Wed, 18 Dec 2024 14:37:32 +0530 Subject: [PATCH] Update README.md with additional data formats and use cases. Signed-off-by: Dushyant Behl --- README.md | 69 +++++++++++++++++++++++------ docs/advanced-data-preprocessing.md | 1 + 2 files changed, 56 insertions(+), 14 deletions(-) create mode 100644 docs/advanced-data-preprocessing.md diff --git a/README.md b/README.md index 6d18d72ab..72b7882cb 100644 --- a/README.md +++ b/README.md @@ -61,13 +61,28 @@ pip install fms-hf-tuning[aim] ``` For more details on how to enable and use the trackers, Please see, [the experiment tracking section below](#experiment-tracking). -## Data format -We support the following data formats: +## Data Support -### 1. JSON formats with a single sequence and a specified response_template to use for masking on completion. +Users can pass in a single file via `--training_data_path` argument which contains data in any of the [supported formats](#supported-data-formats) along side other arguments required for various [use cases](#use-cases-supported-with-training_data_path-data-argument) (see details below) or can use our powerful [data preprocessing backend](./docs/advanced-data-preprocessing.md) to preprocess datasets on the fly. -#### 1.1 Pre-process the JSON/JSONL dataset - Pre-process the JSON/JSONL dataset to contain a single sequence of each data instance containing input + Response. The trainer is configured to expect a response template as a string. For example, if one wants to prepare the `alpaca` format data to feed into this trainer, it is quite easy and can be done with the following code. +Below, we mention the list of supported data usecases via `--training_data_path` argument. For details of our advanced data preprocessing see more details in [Advanced Data Preprocessing](./docs/advanced-data-preprocessing.md). + +## Supported Data Formats +We support the following data formats via `--training_data_path` argument + +Data Format | Tested Support +------------|--------------- +JSON | ✅ +JSONL | ✅ +PARQUET | ✅ +ARROW | ✅ + +## Use cases supported with `training_data_path` argument + +### 1. Data formats with a single sequence and a specified response_template to use for masking on completion. + +#### 1.1 Pre-process the dataset + Pre-process the dataset to contain a single sequence of each data instance containing input + response. The trainer is configured to expect a `response template` as a string. For example, if one wants to prepare the `alpaca` format data to feed into this trainer, it is quite easy and can be done with the following code. ```python PROMPT_DICT = { @@ -99,11 +114,10 @@ The `response template` corresponding to the above dataset and the `Llama` token The same way can be applied to any dataset, with more info can be found [here](https://huggingface.co/docs/trl/main/en/sft_trainer#format-your-input-prompts). -Once the JSON is converted using the formatting function, pass the `dataset_text_field` containing the single sequence to the trainer. +Once the data is converted using the formatting function, pass the `dataset_text_field` containing the single sequence to the trainer. -#### 1.2 Format JSON/JSONL on the fly - Pass a JSON/JSONL and a `data_formatter_template` to use the formatting function on the fly while tuning. The template should specify fields of JSON with `{{field}}`. While tuning, the data will be converted to a single sequence using the template. - JSON fields can contain alpha-numeric characters, spaces and the following special symbols - "." , "_", "-". +#### 1.2 Format the dataset on the fly + Pass a dataset and a `data_formatter_template` to use the formatting function on the fly while tuning. The template should specify fields of the dataset with `{{field}}`. While tuning, the data will be converted to a single sequence using the template. Data fields can contain alpha-numeric characters, spaces and the following special symbols - "." , "_", "-". Example: Train.json `[{ "input" : , @@ -113,23 +127,50 @@ Example: Train.json ]` data_formatter_template: `### Input: {{input}} \n\n##Label: {{output}}` -Formatting will happen on the fly while tuning. The keys in template should match fields in JSON file. The `response template` corresponding to the above template will need to be supplied. in this case, `response template` = `\n## Label:`. +Formatting will happen on the fly while tuning. The keys in template should match fields in the dataset file. The `response template` corresponding to the above template will need to be supplied. in this case, `response template` = `\n## Label:`. ##### In conclusion, if using the reponse_template and single sequence, either the `data_formatter_template` argument or `dataset_text_field` needs to be supplied to the trainer. -### 2. JSON/JSONL with input and output fields (no response template) +### 2. Dataset with input and output fields (no response template) - Pass a JSON/JSONL containing fields "input" with source text and "output" with class labels. Pre-format the input as you see fit. The output field will simply be concatenated to the end of input to create single sequence, and input will be masked. + Pass a [supported dataset](#supported-data-formats) containing fields `"input"` with source text and `"output"` with class labels. Pre-format the input as you see fit. The output field will simply be concatenated to the end of input to create single sequence, and input will be masked. - The "input" and "output" field names are mandatory and cannot be changed. + The `"input"` and `"output"` field names are mandatory and cannot be changed. -Example: Train.jsonl +Example: For a JSON dataset like, `Train.jsonl` ``` {"input": "### Input: Colorado is a state in USA ### Output:", "output": "USA : Location"} {"input": "### Input: Arizona is also a state in USA ### Output:", "output": "USA : Location"} ``` +### 3. Chat Sytle Single/Multi turn datasets + + Pass a dataset containing single/multi turn chat dataset. Your dataset could be supplied like + +``` +{"messages": [{"content": "You are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior.", "role": "system"}, {"content": "Look up a word that rhymes with exist", "role": "user"}, {"content": "I found a word that rhymes with \"exist\":\n1\\. Mist", "role": "assistant"}], "group": "lab_extension", "dataset": "base/full-extension", "metadata": "{\"num_turns\": 1}"} +``` + +containing single/multi turn chat. + +The chat template used to render this data will be `tokenizer.chat_template` from model's default tokenizer config or can be overridden using `--chat_template ` argument. + +Users also need to pass `--response_template` and `--instruction_template` which are pieces of text representing start of +`assistant` and `human` response inside the formatted chat template. + +The code internally uses [`DataCollatorForCompletionOnlyLM`](https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py#L93) to perform masking of text ensuring model learns only on the `assistant` responses for both single and multi turn chat. + +### 3. Pre tokenized datasets. + +Users can also pass a pretokenized dataset (containing `input_ids` and `labels` columns) as `--training_data_path` argument e.g. + +``` +python tuning/sft_trainer.py ... --training_data_path twitter_complaints_tokenized_with_maykeye_tinyllama_v0.arrow +``` + +For advanced data preprocessing support please see [this document](./docs/advanced-data-preprocessing.md). + ## Supported Models - For each tuning technique, we run testing on a single large model of each architecture type and claim support for the smaller models. For example, with QLoRA technique, we tested on granite-34b GPTBigCode and claim support for granite-20b-multilingual. diff --git a/docs/advanced-data-preprocessing.md b/docs/advanced-data-preprocessing.md new file mode 100644 index 000000000..a2c7c1dfa --- /dev/null +++ b/docs/advanced-data-preprocessing.md @@ -0,0 +1 @@ +# Advanced Data Processing \ No newline at end of file