Skip to content

Commit

Permalink
Update README.md with additional data formats and use cases.
Browse files Browse the repository at this point in the history
Signed-off-by: Dushyant Behl <[email protected]>
  • Loading branch information
dushyantbehl committed Dec 20, 2024
1 parent d7f06f5 commit d19b8ec
Show file tree
Hide file tree
Showing 2 changed files with 56 additions and 14 deletions.
69 changes: 55 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,13 +61,28 @@ pip install fms-hf-tuning[aim]
```
For more details on how to enable and use the trackers, Please see, [the experiment tracking section below](#experiment-tracking).

## Data format
We support the following data formats:
## Data Support

### 1. JSON formats with a single sequence and a specified response_template to use for masking on completion.
Users can pass in a single file via `--training_data_path` argument which contains data in any of the [supported formats](#supported-data-formats) along side other arguments required for various [use cases](#use-cases-supported-with-training_data_path-data-argument) (see details below) or can use our powerful [data preprocessing backend](./docs/advanced-data-preprocessing.md) to preprocess datasets on the fly.

#### 1.1 Pre-process the JSON/JSONL dataset
Pre-process the JSON/JSONL dataset to contain a single sequence of each data instance containing input + Response. The trainer is configured to expect a response template as a string. For example, if one wants to prepare the `alpaca` format data to feed into this trainer, it is quite easy and can be done with the following code.
Below, we mention the list of supported data usecases via `--training_data_path` argument. For details of our advanced data preprocessing see more details in [Advanced Data Preprocessing](./docs/advanced-data-preprocessing.md).

## Supported Data Formats
We support the following data formats via `--training_data_path` argument

Data Format | Tested Support
------------|---------------
JSON | ✅
JSONL | ✅
PARQUET | ✅
ARROW | ✅

## Use cases supported with `training_data_path` argument

### 1. Data formats with a single sequence and a specified response_template to use for masking on completion.

#### 1.1 Pre-process the dataset
Pre-process the dataset to contain a single sequence of each data instance containing input + response. The trainer is configured to expect a `response template` as a string. For example, if one wants to prepare the `alpaca` format data to feed into this trainer, it is quite easy and can be done with the following code.

```python
PROMPT_DICT = {
Expand Down Expand Up @@ -99,11 +114,10 @@ The `response template` corresponding to the above dataset and the `Llama` token

The same way can be applied to any dataset, with more info can be found [here](https://huggingface.co/docs/trl/main/en/sft_trainer#format-your-input-prompts).

Once the JSON is converted using the formatting function, pass the `dataset_text_field` containing the single sequence to the trainer.
Once the data is converted using the formatting function, pass the `dataset_text_field` containing the single sequence to the trainer.

#### 1.2 Format JSON/JSONL on the fly
Pass a JSON/JSONL and a `data_formatter_template` to use the formatting function on the fly while tuning. The template should specify fields of JSON with `{{field}}`. While tuning, the data will be converted to a single sequence using the template.
JSON fields can contain alpha-numeric characters, spaces and the following special symbols - "." , "_", "-".
#### 1.2 Format the dataset on the fly
Pass a dataset and a `data_formatter_template` to use the formatting function on the fly while tuning. The template should specify fields of the dataset with `{{field}}`. While tuning, the data will be converted to a single sequence using the template. Data fields can contain alpha-numeric characters, spaces and the following special symbols - "." , "_", "-".

Example: Train.json
`[{ "input" : <text>,
Expand All @@ -113,23 +127,50 @@ Example: Train.json
]`
data_formatter_template: `### Input: {{input}} \n\n##Label: {{output}}`

Formatting will happen on the fly while tuning. The keys in template should match fields in JSON file. The `response template` corresponding to the above template will need to be supplied. in this case, `response template` = `\n## Label:`.
Formatting will happen on the fly while tuning. The keys in template should match fields in the dataset file. The `response template` corresponding to the above template will need to be supplied. in this case, `response template` = `\n## Label:`.

##### In conclusion, if using the reponse_template and single sequence, either the `data_formatter_template` argument or `dataset_text_field` needs to be supplied to the trainer.

### 2. JSON/JSONL with input and output fields (no response template)
### 2. Dataset with input and output fields (no response template)

Pass a JSON/JSONL containing fields "input" with source text and "output" with class labels. Pre-format the input as you see fit. The output field will simply be concatenated to the end of input to create single sequence, and input will be masked.
Pass a [supported dataset](#supported-data-formats) containing fields `"input"` with source text and `"output"` with class labels. Pre-format the input as you see fit. The output field will simply be concatenated to the end of input to create single sequence, and input will be masked.

The "input" and "output" field names are mandatory and cannot be changed.
The `"input"` and `"output"` field names are mandatory and cannot be changed.

Example: Train.jsonl
Example: For a JSON dataset like, `Train.jsonl`

```
{"input": "### Input: Colorado is a state in USA ### Output:", "output": "USA : Location"}
{"input": "### Input: Arizona is also a state in USA ### Output:", "output": "USA : Location"}
```

### 3. Chat Sytle Single/Multi turn datasets

Pass a dataset containing single/multi turn chat dataset. Your dataset could be supplied like

```
{"messages": [{"content": "You are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior.", "role": "system"}, {"content": "Look up a word that rhymes with exist", "role": "user"}, {"content": "I found a word that rhymes with \"exist\":\n1\\. Mist", "role": "assistant"}], "group": "lab_extension", "dataset": "base/full-extension", "metadata": "{\"num_turns\": 1}"}
```

containing single/multi turn chat.

The chat template used to render this data will be `tokenizer.chat_template` from model's default tokenizer config or can be overridden using `--chat_template <chat-template-string>` argument.

Users also need to pass `--response_template` and `--instruction_template` which are pieces of text representing start of
`assistant` and `human` response inside the formatted chat template.

The code internally uses [`DataCollatorForCompletionOnlyLM`](https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py#L93) to perform masking of text ensuring model learns only on the `assistant` responses for both single and multi turn chat.

### 3. Pre tokenized datasets.

Users can also pass a pretokenized dataset (containing `input_ids` and `labels` columns) as `--training_data_path` argument e.g.

```
python tuning/sft_trainer.py ... --training_data_path twitter_complaints_tokenized_with_maykeye_tinyllama_v0.arrow
```

For advanced data preprocessing support please see [this document](./docs/advanced-data-preprocessing.md).

## Supported Models

- For each tuning technique, we run testing on a single large model of each architecture type and claim support for the smaller models. For example, with QLoRA technique, we tested on granite-34b GPTBigCode and claim support for granite-20b-multilingual.
Expand Down
1 change: 1 addition & 0 deletions docs/advanced-data-preprocessing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Advanced Data Processing

0 comments on commit d19b8ec

Please sign in to comment.