Update 004-datapreprocessor.md

foundation-model-stack · Nov 18, 2024 · 538479c · 538479c
1 parent d563c20
commit 538479c
Showing 1 changed file with 38 additions and 41 deletions.
diff --git a/architecture_records/004-datapreprocessor.md b/architecture_records/004-datapreprocessor.md
@@ -21,39 +21,37 @@
 ## Summary and Objective
 
 The primary objective of the `DataPreProcessor` design for fms-hf-tuning is to provide a unified yet powerful interface for handling diverse data formats and configurations.
-This interface should cater to various user expertise levels, enabling basic users to easily load and process data, while allowing advanced users to customize their data handling through pre-defined configuration files.
+This interface should cater to various user expertise levels, enabling basic users to easily load and process data, while allowing advanced users to customize their data handling.
 
 ### Key Goals:
 1. **Broad Data Format Support**: Allow datasets in formats such as Arrow, Parquet, and CSV.
 1. **Compatibility with Multiple Datasets and Files**: Enable multiple files per dataset and interleaving or mixing of datasets.
 1. **Support for Different Data Modalities**: Include images, audio, and text data, along with modality-specific preprocessing options.
 1. **User-Focused Configurations**: Provide simple data loading for regular users, while enabling advanced configurations for expert users.
-1. **Template-Based Preprocessing**: Support chat templates and masking, where necessary, for chat-based or other template-dependent preprocesing requirements.
+1. **Template-Based Preprocessing**: Support jinja template rendering, where necessary, for template-dependent preprocesing requirements.
 
 ### Motivation
 
 The main motivation for this ADR stems from the fact that fms-hf-tuning is being used by many teams for a diverse set of use cases which are not currently supported in the library.
 
-In the library for data preposssing we currently take two primary arguments `training_data_path` and `validataion_data_path` which take in a single file location for a dataset. Current library supports only `Json` data but can handle both pretokenised or non tokenised data by performing `input` masking and custom data formatting.
+In our library, currently for data preposssing we take two primary arguments `training_data_path` and `validataion_data_path` which take in a single file location for a dataset. Current library supports only `Json` data but can handle both pretokenised or non tokenised data by performing `input` masking and custom data formatting.
 
-The first motivation for a change is the requirements from users asking for multiple datasets and even multiple data files in a dataset. Also there are teams which are training using
-Parquet and Arrow format data so they require support for additional data formats in the code.
-Futher requirements from teams is to have a way to interleave datasets at run time by specifying static weights to mix different datasets which is also not supported by the code yet.
-Finally other requirements are to have preprocesing support for multiple modalities of data (starting with Image first) and have support for advanced preprocesing like jinja based template rendering of the dataset before consumption.
+The first motivation for a change is the requirements from users asking for multiple datasets and even multiple data files in a dataset. Also there are teams which are training using Parquet and Arrow format data so they require support for additional data formats in the code.
+Futher requirements from teams is to have a way to interleave datasets at run time by specifying static weights to mix different datasets which is also not supported by the code yet. Finally other requirements are to have preprocesing support for multiple modalities of data (starting with Image first) and have support for advanced preprocesing like jinja based template rendering of the dataset before consumption.
 
 All these requirements are new and are currently not supported by the library which motivated us to propose a change in the design of data preprocesing in this library to incorporate these and potentially any new changes in one go.
 
 ### User Benefit
 
 Users will benefit from the additional argument which allows them to pass a single `data_config` file specifying how to preprocess their dataset.
-Our data confi file will extend users the capabilities to,
+Our data config file will extend users the capabilities to,
 1. Pass multiple data files and multiple datasets.
 1. Specify static weights in the configuration to interleave datasets.
 1. Define preprocessing routines to apply on the data and in which order
 
 This will make the process of handling custom datasets which might require rendering jinja template or processing image data way much easier.
 
-We do not require users to learn the specification of the additional `data_config` file, as the existing arguments to process dataset which are present in the code `tuning.config.configs.DataArguments` will not be deprecated in this version and users can keep using the same data arguments for use cases being served by the library.
+We do not require users to learn the specification of the additional `data_config` file, as the existing arguments to process dataset which are present in the code [`tuning.config.configs.DataArguments`](https://github.com/foundation-model-stack/fms-hf-tuning/blob/398c2a8fe26d734344240555585d95e05299faa8/tuning/config/configs.py#L67) will not be deprecated in this version and users can keep using the same data arguments for use cases being served by the library.
 
 ## Decision
 
@@ -82,7 +80,7 @@ Please note that most of the users of product here would fall into the simple us
 
 ### Our considerations for the design here are 
 
-1. Allow advanced users to use full power of the huggingface library as much as possible without recreating the same.
+1. Allow advanced users to use full power of the HuggingFace library as much as possible without recreating the same.
 1. Allow advanced users to specify custom data preprocessor pipeline in an easy way.
 1. Ensure the single design can handle these and many more use cases without major changes.
 1. Design for Advanced users while simplify for simple users.
@@ -103,12 +101,10 @@ datasets:
       - /data/stackoverflow-kubernetes_posts
       - /data/stackoverflow-openshift_posts
     data_handlers:
-      - name: render_template
+      - name: tokenize
         arguments:
           remove_columns: all
           batched: false
-          fn_kwargs:
-            jinja_template: "{<jinja-template>}"
   - name: dataset2
     sampling:
       ratio: 0.4
@@ -123,20 +119,24 @@ datasets:
           batched: false
           fn_kwargs:
             jinja_template: "{<jinja-template>}"
+      - name: tokenize
+        arguments:
+          remove_columns: all
+          batched: false
 ```
 
-To iterate again, here our goal is not to reimplement the functionality provided by HuggingFace but rather have a clean interface using a config where advanced users can use advanced HF functions like Iterable datasets or Interleaving datasets and perform custom preprocessing like applying jinja templates etc.
+To iterate again, here our goal is not to reimplement the functionality provided by HuggingFace but rather have a clean interface using a config where advanced users can use things like Iterable datasets or Interleaving datasets and perform custom preprocessing like applying jinja templates etc in an easy way.
 
 In this spec, at top level we have the `Dataprocessor` config which contains just one field `type` which is set to `default`. This is done to ensure any future top level `dataprocessor` configs will go into this block. Users need not touch or provide this as the `default` is automatically selected.
 
-The second block here is where users will list multiple `datasets` and each dataset will contain information on how to process it. We allow arguments like `sampling` for users to specify sampling ratios while `interleaving datasets` to use API like `interleave_datasets` by HuggingFace.
+The second block here is where users will list multiple `datasets` and each dataset will contain information on how to process it. We allow arguments like `sampling` for users to specify sampling ratios while [`interleaving datasets`](https://huggingface.co/docs/datasets/en/process#interleave) to use API like [`interleave_datasets`](https://huggingface.co/docs/datasets/v3.1.0/en/package_reference/main_classes#datasets.interleave_datasets) by HuggingFace.
 
-The most powerful feature of this block is `data_handlers`. Here we allow users to specify a list of routines to apply on the dataset at the time of preprocessing. A `data_handler` is a `map`{ref HFmaps} operation performed on the dataset to which a user can further pass informational arguments. We expose the full set of arguments of a HF `map` operation here to the user as `kwargs` of a handler.
+The most powerful feature of this block is `data_handlers`. Here we allow users to specify a list of routines to apply on the dataset at the time of preprocessing. A `data_handler` is a [`map`](https://huggingface.co/docs/datasets/en/process#map) operation performed on the dataset to which a user can further pass informational arguments. We expose the full set of arguments of HF [`Dataset.map`](https://huggingface.co/docs/datasets/v3.1.0/en/package_reference/main_classes#datasets.Dataset.map) operation here to the user as `kwargs` of a handler.
 
-As example in `dataset2` the data handler is requesting to apply a `render_template` function on the dataset which processes the dataset and renders the `jinja template` specified as `fn_kwargs.jinja_template`, rest of the arguments like `remove_column` and `batched` are just HF Map API arguments.
+As example in `dataset2` the data handler is requesting to apply a `render_template` function before tokenization on the dataset which processes the dataset and renders the `jinja template` specified as `fn_kwargs.jinja_template`, rest of the arguments like `remove_column` and `batched` are just HF Map API arguments.
 
 ```
-- name: dataset1
+- name: dataset2
     sampling:
       ratio: 0.4
     data_paths:
@@ -150,39 +150,36 @@ As example in `dataset2` the data handler is requesting to apply a `render_templ
           batched: false
           fn_kwargs:
             jinja_template: "{<jinja-template>}"
+      - name: tokenize
+        arguments:
+          remove_columns: all
+          batched: false
 ```
 
-By allowing the users to specify data handlers like this we allow them to use full Hugging Face API and at the same time specify preprocessing routines in a fixed order. The handlers list specify a `DAG` of operations to apply on the dataset and will be executed by the code in that order.
+By allowing the users to specify data handlers like this we allow them to use full Hugging Face API and at the same time specify preprocessing routines in a fixed order. The handlers list specify a [`DAG`](https://en.wikipedia.org/wiki/Directed_acyclic_graph) of operations to apply on the dataset and will be executed by the code in that order.
 
 Furthermore this design allows flexibility to be extended to any upcoming usecase because any operation to be executed on the dataset can be broken down into function execution implemented as data handlers.
 
 This makes our spec a complete solution for advanced users of the library allowing them to specify complete preprocessing operations to be applied to the dataset via a config file.
 
-Finally, with this spec we do not want to break the functionality for the simple users of the library. A simple user which wants to just use the library with a single dataset like today can pass the same dataset via
+Finally, with this spec we do not want to break the functionality for the simple users of the library. A simple user which wants to just use the library with a single dataset like today can pass the same dataset via `--training_data_path <file> --validataion_data_path <file>` arguments.
 
-```
---training_data_path <file> --validataion_data_path <file>
-```
-
-arguments. Infact we do not change the behaviour currently supported by any of the `tuning.config.configs.DataArguments` arguments hence allowing the simple users of the library to continue using the library as is.
+Infact we do not change the behaviour currently supported by any of the `tuning.config.configs.DataArguments` arguments hence allowing the simple users of the library to continue using the library as is.
 
 ### Alternatives Considered
 
-1. Letting users process their own data and pass file(s) directly to this library.
-
-A simple alternative to avoid all this is to have the users process their own data, this is also in lines of the fact that most
-workloads contain preprocessed data which is used by simple users as is for their tuning/training.
+### Letting users process their own data and pass file(s) directly to this library.
 
-Many users coming to this library have advanced set of use cases. Other Researchers looking to use this library are looking for features like `jinja template` rendering, image data processing, mixing and merging datasets. While this can be done at user level most users are not looking to write code to do all this preprocessing but use tools which implement them to perform these tasks. 
-At the same time this leads to code duplication across many teams which is something we want to avoid.
+A simple alternative to avoid all this is to have the users process their own data, this is also in lines of the fact that most workloads contain preprocessed data which is used by simple users as is for their tuning/training.
 
-More importantly as stated in the motivation we are getting ever increased demand from users who want to use this library directly with their dataset and have quick roundtrip for testing. This design allows users to specify simple parameters in the config and test for complex usecases easily.
+The reason to have this design is that many users coming to this library have advanced set of use cases. As stated in the motivation we are getting ever increased demand from researchers looking to use this library are looking for features like `jinja template` rendering, image data processing, mixing and merging datasets. While this can be done at user level most users are not looking to write code to do all this preprocessing but use tools which implement them to perform these tasks. 
+Leaving all users to write their own preprocessing logic can also lead to code duplication across many teams which is something we want to avoid.
 
-1. Passing all datasets we take to the Huggingface SFTTrainer api and let it handle them without preprocessing at our end.
+### Passing datasets we take to the Huggingface SFTTrainer api and let it handle them without preprocessing at our end.
 
 Another alternative we have is to take the `dataset` input to this library and pass it directly to the trainer `SFTrainer` in our case directly and let it handle loading and preprocessing the dataset.
 
-In [SFTrainer](https://huggingface.co/docs/trl/v0.12.1/en/sft_trainer#trl.SFTTrainer) apart from specifying the `train_dataset` and `eval_dataset` for both of which `SFTtrainer` supports iterable datasets as well so we can ideally pass a large dataset which should be supported via streaming. 
+[SFTrainer](https://huggingface.co/docs/trl/v0.12.1/en/sft_trainer#trl.SFTTrainer) supports specifying the `train_dataset` and `eval_dataset` for both of which it supports iterable datasets along with normal datasets allowing us to pass a large dataset supported via streaming. 
 
 Please not that even in this case users will need to tell us that the dataset is large and is to be loaded via `streaming=True` because the argument which tells HF to load the dataset in iterable mode or standard mode is passed to [`load_dataset`](https://huggingface.co/docs/datasets/v3.1.0/en/package_reference/loading_methods#datasets.load_dataset)
 
@@ -191,8 +188,7 @@ from datasets import load_dataset
 train_ds = load_dataset('imdb', split='train', streaming=True)
 ```
 
-Additionally, `SFTrainer` has support for [data formatting function](https://huggingface.co/docs/trl/en/sft_trainer#dataset-format-support). Users can pass a `formatting_function` directly to `SFTtrainer` which formats the dataset
-for them,
+Additionally, `SFTrainer` has support for [data formatting function](https://huggingface.co/docs/trl/en/sft_trainer#dataset-format-support). Users can pass a `formatting_function` directly to `SFTtrainer` which formats the dataset for them,
 
 ```
 def formatting_prompts_func(example):
@@ -214,7 +210,7 @@ trainer.train()
 Taken from [HuggingFace docs](https://huggingface.co/docs/trl/en/sft_trainer#dataset-format-support)
 
 As our library is a wrapper on top of HF we cannot direclty allow users to pass a custom formatting function and
-our `data_handler` design can also support formatting dataset akin to `formatting function` where users specify just name of the handler and we apply formatting on our end. The design for `data_handler` that we have is a superset of this feature which is more flexible and can support many more use cases.
+our `data_handler` design can also support formatting dataset in a similar way to `formatting function` where users specify just name of the handler and we apply formatting on our end. The design for `data_handler` that we have is a superset of this feature which is more flexible and can support many more use cases.
 
 ## Consequences
 
@@ -238,11 +234,12 @@ class DataArguments:
 ### Understanding the spec
 With this design we have tried to keep our design simple and close to the HF library as much as possible, e.g. exposing the same map `kwargs` that HF has in our `data_handlers`.
 
-Despite this advanced users will need to understand the spec and be able to write it properly to be processed.
-Furthermore advanced users will need to educate themselves on the data handlers already present in the code. Since the data handlers are selected based on their name we need to ensure the documentation contains complete information on what different data handlers are present and how to use them in the data_config.
+Despite this advanced users will need to understand the spec to be able to write it properly.
+
+Advanced users will also need to educate themselves on the data handlers already present in the code. Since the data handlers are selected based on their name we need to ensure the documentation contains complete information on what different data handlers are present and how to use them in the `data_config`.
 
 ### Sharing config files
-We currently do not propose anything on how advanced users share the `data_config` files created by them with Intermediate and Simple users.
+We currently do not propose anything on how advanced users share the `data_config` files created by them with Intermediate and Simple users. This is left outside the scope of our library.
 
 ### Simple User Perspective
 
@@ -351,4 +348,4 @@ This means the image and audio multi modal datasets will be compatible with our
     * Data handling support for streaming data
 1. State 3:
     * Identify and add any other required predefined data handlers.
-    * Phase out the old implementation in support of the new one.
+    * Phase out the old implementation in support of the new one.