From 2beb5f6bdc34805673b31eaf002e2e65dac09db9 Mon Sep 17 00:00:00 2001 From: Sarah Yurick Date: Wed, 11 Sep 2024 14:55:06 -0700 Subject: [PATCH 01/13] oom best practices draft Signed-off-by: Sarah Yurick --- docs/user-guide/bestpractices.rst | 42 ++++++++++++++++++++++++++++++- 1 file changed, 41 insertions(+), 1 deletion(-) diff --git a/docs/user-guide/bestpractices.rst b/docs/user-guide/bestpractices.rst index 1facfb736..1689d0ad4 100644 --- a/docs/user-guide/bestpractices.rst +++ b/docs/user-guide/bestpractices.rst @@ -16,4 +16,44 @@ Here are the methods in increasing order of compute required for them. #. `Language model labelling `_ - Language models can be used to label text as high quality or low quality. NeMo Curator allows you to connect to arbitrary LLM inference endpoints which you can use to label your data. One example of such an endpoint would be Nemotron-4 340B Instruct on `build.nvidia.com `_. Due to their size, these models can require a lot of compute and are usually infeasible to run across an entire pretraining dataset. We recommend using these large models on very little amounts of data. Fine-tuning datasets can make good use of them. -#. `Reward model labelling `_ - Unlike the previous methods, reward models label the quality of conversations between a user and an assistant instead of labelling the quality of a document. In addition, models (like `Nemotron-4 340B Reward `_) may output multiple scores covering different categories. Like LLM labelling, NeMo Curator can connect to arbitrary reward models hosted as an external service. Due to these differences and their large size, we recommend using reward models when filtering fine-tuning data. In particular, synthetic data filtering is a good use of them. \ No newline at end of file +#. `Reward model labelling `_ - Unlike the previous methods, reward models label the quality of conversations between a user and an assistant instead of labelling the quality of a document. In addition, models (like `Nemotron-4 340B Reward `_) may output multiple scores covering different categories. Like LLM labelling, NeMo Curator can connect to arbitrary reward models hosted as an external service. Due to these differences and their large size, we recommend using reward models when filtering fine-tuning data. In particular, synthetic data filtering is a good use of them. + +------------------------------------------- +Handling Out-of-Memory (OOM) Errors +------------------------------------------- +NeMo Curator is designed to be scalable with large amounts of text data, but OOM errors occur when the available GPU memory is insufficient for a given task. +To help avoid these issues and ensure efficient processing, here are some strategies for managing memory usage and mitigating OOM challenges. + +Add More GPUs +~~~~~~~~~~~~~ +If possible, scale your system by adding more GPUs. +This provides additional VRAM (Video Random Access Memory), which is crucial for holding datasets and intermediate computations. +Thus, adding more GPUs allows you to distribute the workload, reducing the memory load on each GPU. + +Utilize RMM Options +~~~~~~~~~~~~~~~~~~~ +`RAPIDS Memory Manager (RMM) `_ is a package that enables you to allocate device memory in a highly configurable way. +Here are some features which can help optimize memory usage: + +* Enable asynchronous memory allocation: Use the ``--rmm-async`` flag to allow RMM to handle memory allocation more efficiently, by allocating and deallocating GPU memory asynchronously. +* Set a memory release threshold: For example, ``--rmm-release-threshold 50GB`` can help prevent holding onto excess memory, releasing unused memory when a certain limit is reached. Please keep in mind that using this flag may degrade performance slightly as RMM is busy releasing the unused memory. + +Estimate Total VRAM Requirements +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Approximating how much VRAM is needed to process your dataset helps avoid running into OOMs. +When doing this, there are a couple factors to keep in mind: + +* GPU architecture, such as the NVIDIA A100 or the NVIDIA H100 GPU. +* Data quantification, such as by size (e.g., in GB or TB) or by number of tokens (usually billions of tokens). + +With these in mind, here are some general rules of thumb you can use to estimate your memory requirements: + +* TODO: Suggest approximate total VRAM needed to process N TB of data, per step in NeMo Curator pipeline + +Fuzzy Deduplication Guidelines +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Fuzzy deduplication is one of the most computationally expensive algorithms within the NeMo Curator pipeline. +Here are some suggestions for managing memory use during fuzzy deduplication: + +* Reduce bucket counts: During deduplication, the data is grouped into buckets to compare and identify near-duplicate documents. Reducing the number of buckets can help decrease the number of data points loaded into memory for comparison. However, a smaller bucket count can reduce the accuracy of deduplication, so it is important to find an optimal balance between memory usage and deduplication accuracy. You can experiment with this by using the ``buckets_per_shuffle`` parameter when initializing your ``FuzzyDuplicatesConfig``. +* Adjust files per partition: Processing large datasets in smaller chunks can help reduce the memory load. When reading data with ``DocumentDataset.read_json`` or ``DocumentDataset.read_parquet``, start with a smaller ``files_per_partition`` value and increase as needed. From 7c361bead66b938439433b7d64ac161b3c91a4b7 Mon Sep 17 00:00:00 2001 From: Sarah Yurick Date: Wed, 11 Sep 2024 15:26:17 -0700 Subject: [PATCH 02/13] remove vram section Signed-off-by: Sarah Yurick --- docs/user-guide/bestpractices.rst | 12 ------------ 1 file changed, 12 deletions(-) diff --git a/docs/user-guide/bestpractices.rst b/docs/user-guide/bestpractices.rst index 1689d0ad4..8f9dd792a 100644 --- a/docs/user-guide/bestpractices.rst +++ b/docs/user-guide/bestpractices.rst @@ -38,18 +38,6 @@ Here are some features which can help optimize memory usage: * Enable asynchronous memory allocation: Use the ``--rmm-async`` flag to allow RMM to handle memory allocation more efficiently, by allocating and deallocating GPU memory asynchronously. * Set a memory release threshold: For example, ``--rmm-release-threshold 50GB`` can help prevent holding onto excess memory, releasing unused memory when a certain limit is reached. Please keep in mind that using this flag may degrade performance slightly as RMM is busy releasing the unused memory. -Estimate Total VRAM Requirements -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Approximating how much VRAM is needed to process your dataset helps avoid running into OOMs. -When doing this, there are a couple factors to keep in mind: - -* GPU architecture, such as the NVIDIA A100 or the NVIDIA H100 GPU. -* Data quantification, such as by size (e.g., in GB or TB) or by number of tokens (usually billions of tokens). - -With these in mind, here are some general rules of thumb you can use to estimate your memory requirements: - -* TODO: Suggest approximate total VRAM needed to process N TB of data, per step in NeMo Curator pipeline - Fuzzy Deduplication Guidelines ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Fuzzy deduplication is one of the most computationally expensive algorithms within the NeMo Curator pipeline. From f94bed113a77beb1da49e64e009cc3ac225dca3f Mon Sep 17 00:00:00 2001 From: Sarah Yurick Date: Wed, 18 Sep 2024 15:48:09 -0700 Subject: [PATCH 03/13] address ryan's review Signed-off-by: Sarah Yurick --- docs/user-guide/bestpractices.rst | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-) diff --git a/docs/user-guide/bestpractices.rst b/docs/user-guide/bestpractices.rst index 8f9dd792a..568677e4e 100644 --- a/docs/user-guide/bestpractices.rst +++ b/docs/user-guide/bestpractices.rst @@ -19,29 +19,37 @@ Here are the methods in increasing order of compute required for them. #. `Reward model labelling `_ - Unlike the previous methods, reward models label the quality of conversations between a user and an assistant instead of labelling the quality of a document. In addition, models (like `Nemotron-4 340B Reward `_) may output multiple scores covering different categories. Like LLM labelling, NeMo Curator can connect to arbitrary reward models hosted as an external service. Due to these differences and their large size, we recommend using reward models when filtering fine-tuning data. In particular, synthetic data filtering is a good use of them. ------------------------------------------- -Handling Out-of-Memory (OOM) Errors +Handling GPU Out-of-Memory (OOM) Errors ------------------------------------------- NeMo Curator is designed to be scalable with large amounts of text data, but OOM errors occur when the available GPU memory is insufficient for a given task. To help avoid these issues and ensure efficient processing, here are some strategies for managing memory usage and mitigating OOM challenges. -Add More GPUs -~~~~~~~~~~~~~ -If possible, scale your system by adding more GPUs. -This provides additional VRAM (Video Random Access Memory), which is crucial for holding datasets and intermediate computations. -Thus, adding more GPUs allows you to distribute the workload, reducing the memory load on each GPU. - Utilize RMM Options ~~~~~~~~~~~~~~~~~~~ `RAPIDS Memory Manager (RMM) `_ is a package that enables you to allocate device memory in a highly configurable way. +The NeMo Curator team has found several of its features to be especially useful for fuzzy deduplication, notably the connected components step. Here are some features which can help optimize memory usage: * Enable asynchronous memory allocation: Use the ``--rmm-async`` flag to allow RMM to handle memory allocation more efficiently, by allocating and deallocating GPU memory asynchronously. * Set a memory release threshold: For example, ``--rmm-release-threshold 50GB`` can help prevent holding onto excess memory, releasing unused memory when a certain limit is reached. Please keep in mind that using this flag may degrade performance slightly as RMM is busy releasing the unused memory. +You can set these flags while calling your Python script directly, for example: + +.. code-block:: bash + + python your_script.py --rmm-async --rmm-release-threshold 50GB + Fuzzy Deduplication Guidelines ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Fuzzy deduplication is one of the most computationally expensive algorithms within the NeMo Curator pipeline. Here are some suggestions for managing memory use during fuzzy deduplication: -* Reduce bucket counts: During deduplication, the data is grouped into buckets to compare and identify near-duplicate documents. Reducing the number of buckets can help decrease the number of data points loaded into memory for comparison. However, a smaller bucket count can reduce the accuracy of deduplication, so it is important to find an optimal balance between memory usage and deduplication accuracy. You can experiment with this by using the ``buckets_per_shuffle`` parameter when initializing your ``FuzzyDuplicatesConfig``. +* Reduce bucket counts: During deduplication, the data is grouped into buckets to compare and identify near-duplicate documents. Increasing the number of buckets increases the probability that two documents within a given Jaccard similarity score are marked as duplicates. This leads to more accurate results as it reduces the number of false negatives. However, increasing the number of buckets also increases the memory requirements from the increased number of hashes it needs to store. Thus, it is important to find an optimal balance between memory usage and deduplication accuracy. You can experiment with this by using the ``num_buckets`` parameter when initializing your ``FuzzyDuplicatesConfig``. +* Reduce buckets per shuffle: Because duplicates are still considered bucket by bucket, reducing the ``buckets_per_shuffle`` parameter in the ``FuzzyDuplicatesConfig`` does not affect accuracy. Instead, reducing the buckets per shuffle helps lower the amount of data being transferred between GPUs. However, using a lower ``buckets_per_shuffle`` will increase the time it takes to process the data. * Adjust files per partition: Processing large datasets in smaller chunks can help reduce the memory load. When reading data with ``DocumentDataset.read_json`` or ``DocumentDataset.read_parquet``, start with a smaller ``files_per_partition`` value and increase as needed. + +Add More GPUs +~~~~~~~~~~~~~ +If possible, scale your system by adding more GPUs. +This provides additional VRAM (Video Random Access Memory), which is crucial for holding datasets and intermediate computations. +Thus, adding more GPUs allows you to distribute the workload, reducing the memory load on each GPU. From 8002a3e93c9d372ccdb27e73cda2ae5e7dc62c0b Mon Sep 17 00:00:00 2001 From: Sarah Yurick Date: Tue, 24 Sep 2024 14:38:08 -0700 Subject: [PATCH 04/13] add other's suggestions Signed-off-by: Sarah Yurick --- docs/user-guide/bestpractices.rst | 30 +++++++++++++++++++++++++++--- 1 file changed, 27 insertions(+), 3 deletions(-) diff --git a/docs/user-guide/bestpractices.rst b/docs/user-guide/bestpractices.rst index 568677e4e..546dc01e9 100644 --- a/docs/user-guide/bestpractices.rst +++ b/docs/user-guide/bestpractices.rst @@ -33,11 +33,19 @@ Here are some features which can help optimize memory usage: * Enable asynchronous memory allocation: Use the ``--rmm-async`` flag to allow RMM to handle memory allocation more efficiently, by allocating and deallocating GPU memory asynchronously. * Set a memory release threshold: For example, ``--rmm-release-threshold 50GB`` can help prevent holding onto excess memory, releasing unused memory when a certain limit is reached. Please keep in mind that using this flag may degrade performance slightly as RMM is busy releasing the unused memory. -You can set these flags while calling your Python script directly, for example: +You can set these flags while initializing your own Dask client, for example: -.. code-block:: bash +.. code-block:: python - python your_script.py --rmm-async --rmm-release-threshold 50GB + from dask_cuda import LocalCUDACluster + from dask.distributed import Client + + cluster = LocalCUDACluster( + rmm_async=True, + rmm_release_threshold="50GB", + ) + + client = Client(cluster) Fuzzy Deduplication Guidelines ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -45,11 +53,27 @@ Fuzzy deduplication is one of the most computationally expensive algorithms with Here are some suggestions for managing memory use during fuzzy deduplication: * Reduce bucket counts: During deduplication, the data is grouped into buckets to compare and identify near-duplicate documents. Increasing the number of buckets increases the probability that two documents within a given Jaccard similarity score are marked as duplicates. This leads to more accurate results as it reduces the number of false negatives. However, increasing the number of buckets also increases the memory requirements from the increased number of hashes it needs to store. Thus, it is important to find an optimal balance between memory usage and deduplication accuracy. You can experiment with this by using the ``num_buckets`` parameter when initializing your ``FuzzyDuplicatesConfig``. + * The user may also need to change the ``hashes_per_bucket`` parameter to match the same Jaccard threshold being aimed for. Think of it like this: with a high ``num_buckets`` and low ``hashes_per_bucket``, the hashes of a string will be spread out across many buckets, which reduces the chances of dissimilar strings being hashed into the same bucket, but increases the chances of similar strings being hashed into different buckets. On the other hand, with a low ``num_buckets`` and high ``hashes_per_bucket``, the hashes will be more densely packed into a smaller number of buckets, which not only increases the likelihood of similar strings sharing buckets, but also increases the chances of dissimilar strings being hashed into the same bucket. * Reduce buckets per shuffle: Because duplicates are still considered bucket by bucket, reducing the ``buckets_per_shuffle`` parameter in the ``FuzzyDuplicatesConfig`` does not affect accuracy. Instead, reducing the buckets per shuffle helps lower the amount of data being transferred between GPUs. However, using a lower ``buckets_per_shuffle`` will increase the time it takes to process the data. * Adjust files per partition: Processing large datasets in smaller chunks can help reduce the memory load. When reading data with ``DocumentDataset.read_json`` or ``DocumentDataset.read_parquet``, start with a smaller ``files_per_partition`` value and increase as needed. + * When reading your data, we suggest aiming to create partitions no larger than 2GB. For example, if you know each file is ~100MB, then setting ``files_per_partition=20`` would result in partitions that are about 2GB in size. + * For other suggestions on best practices regarding reading data with Dask, please refer to `Dask cuDF Best Practices `_. + +Using the ``get_client`` Function +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +For both GPU and CPU operations, we provide the ``get_client`` to initialize your Dask client with a ``LocalCUDACluster`` or ``LocalCluster``, respectively. While the NeMo Curator team has established default values for the parameters of the ``get_client`` function that are suitable for most scenarios, it is useful to understand these parameters and become familiar with them to ensure optimal performance and adherence to best practices when working with Dask configurations and setups. + +Please refer to the `distributed_utils.py `_ script for detailed docstrings, especially for the ``get_client`` function, and the ``start_dask_gpu_local_cluster`` and ``start_dask_cpu_local_cluster`` functions which are called by ``get_client``. Add More GPUs ~~~~~~~~~~~~~ If possible, scale your system by adding more GPUs. This provides additional VRAM (Video Random Access Memory), which is crucial for holding datasets and intermediate computations. Thus, adding more GPUs allows you to distribute the workload, reducing the memory load on each GPU. + +.. TODO: Share rough dataset sizes and how many GPUs we've been able to run this on internally; that can give a sense of the requirements. + +Report GPU Memory and Utilization +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When debugging your GPU memory errors, it can be useful to capture and understand your GPU usage per step in the NeMo Curator pipeline. For this, using RMM's `Memory stastics and profiling `_ can be helpful. You may also refer to `this article `_, for a general tutorial including how to monitor GPUs with a dashboard. From 79163c070b26f0d9cab163b3aeabadb749622395 Mon Sep 17 00:00:00 2001 From: Sarah Yurick Date: Wed, 25 Sep 2024 13:32:29 -0700 Subject: [PATCH 05/13] fix indents Signed-off-by: Sarah Yurick --- docs/user-guide/bestpractices.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/user-guide/bestpractices.rst b/docs/user-guide/bestpractices.rst index 546dc01e9..c1f81c561 100644 --- a/docs/user-guide/bestpractices.rst +++ b/docs/user-guide/bestpractices.rst @@ -53,11 +53,11 @@ Fuzzy deduplication is one of the most computationally expensive algorithms with Here are some suggestions for managing memory use during fuzzy deduplication: * Reduce bucket counts: During deduplication, the data is grouped into buckets to compare and identify near-duplicate documents. Increasing the number of buckets increases the probability that two documents within a given Jaccard similarity score are marked as duplicates. This leads to more accurate results as it reduces the number of false negatives. However, increasing the number of buckets also increases the memory requirements from the increased number of hashes it needs to store. Thus, it is important to find an optimal balance between memory usage and deduplication accuracy. You can experiment with this by using the ``num_buckets`` parameter when initializing your ``FuzzyDuplicatesConfig``. - * The user may also need to change the ``hashes_per_bucket`` parameter to match the same Jaccard threshold being aimed for. Think of it like this: with a high ``num_buckets`` and low ``hashes_per_bucket``, the hashes of a string will be spread out across many buckets, which reduces the chances of dissimilar strings being hashed into the same bucket, but increases the chances of similar strings being hashed into different buckets. On the other hand, with a low ``num_buckets`` and high ``hashes_per_bucket``, the hashes will be more densely packed into a smaller number of buckets, which not only increases the likelihood of similar strings sharing buckets, but also increases the chances of dissimilar strings being hashed into the same bucket. + * The user may also need to change the ``hashes_per_bucket`` parameter to match the same Jaccard threshold being aimed for. Think of it like this: with a high ``num_buckets`` and low ``hashes_per_bucket``, the hashes of a string will be spread out across many buckets, which reduces the chances of dissimilar strings being hashed into the same bucket, but increases the chances of similar strings being hashed into different buckets. On the other hand, with a low ``num_buckets`` and high ``hashes_per_bucket``, the hashes will be more densely packed into a smaller number of buckets, which not only increases the likelihood of similar strings sharing buckets, but also increases the chances of dissimilar strings being hashed into the same bucket. * Reduce buckets per shuffle: Because duplicates are still considered bucket by bucket, reducing the ``buckets_per_shuffle`` parameter in the ``FuzzyDuplicatesConfig`` does not affect accuracy. Instead, reducing the buckets per shuffle helps lower the amount of data being transferred between GPUs. However, using a lower ``buckets_per_shuffle`` will increase the time it takes to process the data. * Adjust files per partition: Processing large datasets in smaller chunks can help reduce the memory load. When reading data with ``DocumentDataset.read_json`` or ``DocumentDataset.read_parquet``, start with a smaller ``files_per_partition`` value and increase as needed. - * When reading your data, we suggest aiming to create partitions no larger than 2GB. For example, if you know each file is ~100MB, then setting ``files_per_partition=20`` would result in partitions that are about 2GB in size. - * For other suggestions on best practices regarding reading data with Dask, please refer to `Dask cuDF Best Practices `_. + * When reading your data, we suggest aiming to create partitions no larger than 2GB. For example, if you know each file is ~100MB, then setting ``files_per_partition=20`` would result in partitions that are about 2GB in size. + * For other suggestions on best practices regarding reading data with Dask, please refer to `Dask cuDF Best Practices `_. Using the ``get_client`` Function ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From eef02c0b1dd4bf4e8318bdede69105052c54f6aa Mon Sep 17 00:00:00 2001 From: Sarah Yurick Date: Mon, 30 Sep 2024 13:30:35 -0700 Subject: [PATCH 06/13] revise dask dashboard paragraph Signed-off-by: Sarah Yurick --- docs/user-guide/bestpractices.rst | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/user-guide/bestpractices.rst b/docs/user-guide/bestpractices.rst index c1f81c561..437db11a2 100644 --- a/docs/user-guide/bestpractices.rst +++ b/docs/user-guide/bestpractices.rst @@ -76,4 +76,6 @@ Thus, adding more GPUs allows you to distribute the workload, reducing the memor Report GPU Memory and Utilization ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -When debugging your GPU memory errors, it can be useful to capture and understand your GPU usage per step in the NeMo Curator pipeline. For this, using RMM's `Memory stastics and profiling `_ can be helpful. You may also refer to `this article `_, for a general tutorial including how to monitor GPUs with a dashboard. +When debugging your GPU memory errors, it can be useful to capture and understand your GPU usage per step in the NeMo Curator pipeline. +The `Dask dashboard `_ is a good starting point to view GPU utilization and memory at a high level. +You may also refer to `this article `_, for a more in-depth tutorial including how to monitor GPUs with a dashboard. From 8815aa9573d53539915276a67828eca485bf2b6b Mon Sep 17 00:00:00 2001 From: Sarah Yurick Date: Mon, 30 Sep 2024 14:23:42 -0700 Subject: [PATCH 07/13] add ryan's comments Signed-off-by: Sarah Yurick --- docs/user-guide/bestpractices.rst | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/docs/user-guide/bestpractices.rst b/docs/user-guide/bestpractices.rst index 437db11a2..0eb284248 100644 --- a/docs/user-guide/bestpractices.rst +++ b/docs/user-guide/bestpractices.rst @@ -52,7 +52,7 @@ Fuzzy Deduplication Guidelines Fuzzy deduplication is one of the most computationally expensive algorithms within the NeMo Curator pipeline. Here are some suggestions for managing memory use during fuzzy deduplication: -* Reduce bucket counts: During deduplication, the data is grouped into buckets to compare and identify near-duplicate documents. Increasing the number of buckets increases the probability that two documents within a given Jaccard similarity score are marked as duplicates. This leads to more accurate results as it reduces the number of false negatives. However, increasing the number of buckets also increases the memory requirements from the increased number of hashes it needs to store. Thus, it is important to find an optimal balance between memory usage and deduplication accuracy. You can experiment with this by using the ``num_buckets`` parameter when initializing your ``FuzzyDuplicatesConfig``. +* Reduce bucket counts: During deduplication, the data is grouped into buckets to compare and identify near-duplicate documents. Increasing the number of buckets increases the probability that two documents within a given Jaccard similarity score are marked as duplicates. However, increasing the number of buckets also increases the memory requirements from the increased number of hashes it needs to store. Thus, it is important to find an optimal balance between memory usage and deduplication accuracy. You can experiment with this by using the ``num_buckets`` parameter when initializing your ``FuzzyDuplicatesConfig``. * The user may also need to change the ``hashes_per_bucket`` parameter to match the same Jaccard threshold being aimed for. Think of it like this: with a high ``num_buckets`` and low ``hashes_per_bucket``, the hashes of a string will be spread out across many buckets, which reduces the chances of dissimilar strings being hashed into the same bucket, but increases the chances of similar strings being hashed into different buckets. On the other hand, with a low ``num_buckets`` and high ``hashes_per_bucket``, the hashes will be more densely packed into a smaller number of buckets, which not only increases the likelihood of similar strings sharing buckets, but also increases the chances of dissimilar strings being hashed into the same bucket. * Reduce buckets per shuffle: Because duplicates are still considered bucket by bucket, reducing the ``buckets_per_shuffle`` parameter in the ``FuzzyDuplicatesConfig`` does not affect accuracy. Instead, reducing the buckets per shuffle helps lower the amount of data being transferred between GPUs. However, using a lower ``buckets_per_shuffle`` will increase the time it takes to process the data. * Adjust files per partition: Processing large datasets in smaller chunks can help reduce the memory load. When reading data with ``DocumentDataset.read_json`` or ``DocumentDataset.read_parquet``, start with a smaller ``files_per_partition`` value and increase as needed. @@ -61,9 +61,11 @@ Here are some suggestions for managing memory use during fuzzy deduplication: Using the ``get_client`` Function ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -For both GPU and CPU operations, we provide the ``get_client`` to initialize your Dask client with a ``LocalCUDACluster`` or ``LocalCluster``, respectively. While the NeMo Curator team has established default values for the parameters of the ``get_client`` function that are suitable for most scenarios, it is useful to understand these parameters and become familiar with them to ensure optimal performance and adherence to best practices when working with Dask configurations and setups. +For both GPU and CPU operations, we provide the ``get_client`` to initialize your Dask client with a ``LocalCUDACluster`` or ``LocalCluster``, respectively. +While the NeMo Curator team has established default values for the parameters of the ``get_client`` function that are suitable for most scenarios, it is useful to understand these parameters and become familiar with them to ensure optimal performance and adherence to best practices when working with Dask configurations and setups. -Please refer to the `distributed_utils.py `_ script for detailed docstrings, especially for the ``get_client`` function, and the ``start_dask_gpu_local_cluster`` and ``start_dask_cpu_local_cluster`` functions which are called by ``get_client``. +Please refer to the API documentation `Dask Cluster Functions `_ for more details about the ``get_client`` function parameters. +You may also refer to the `distributed_utils.py `_ script for the actual function implementations, including the ``start_dask_gpu_local_cluster`` and ``start_dask_cpu_local_cluster`` functions which are called by ``get_client``. Add More GPUs ~~~~~~~~~~~~~ From 479ef70f919893cce1da92d8f87e0b25622b0952 Mon Sep 17 00:00:00 2001 From: Sarah Yurick Date: Mon, 30 Sep 2024 14:43:05 -0700 Subject: [PATCH 08/13] dash bullet points Signed-off-by: Sarah Yurick --- docs/user-guide/bestpractices.rst | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/user-guide/bestpractices.rst b/docs/user-guide/bestpractices.rst index 0eb284248..d98d01ae1 100644 --- a/docs/user-guide/bestpractices.rst +++ b/docs/user-guide/bestpractices.rst @@ -52,12 +52,12 @@ Fuzzy Deduplication Guidelines Fuzzy deduplication is one of the most computationally expensive algorithms within the NeMo Curator pipeline. Here are some suggestions for managing memory use during fuzzy deduplication: -* Reduce bucket counts: During deduplication, the data is grouped into buckets to compare and identify near-duplicate documents. Increasing the number of buckets increases the probability that two documents within a given Jaccard similarity score are marked as duplicates. However, increasing the number of buckets also increases the memory requirements from the increased number of hashes it needs to store. Thus, it is important to find an optimal balance between memory usage and deduplication accuracy. You can experiment with this by using the ``num_buckets`` parameter when initializing your ``FuzzyDuplicatesConfig``. - * The user may also need to change the ``hashes_per_bucket`` parameter to match the same Jaccard threshold being aimed for. Think of it like this: with a high ``num_buckets`` and low ``hashes_per_bucket``, the hashes of a string will be spread out across many buckets, which reduces the chances of dissimilar strings being hashed into the same bucket, but increases the chances of similar strings being hashed into different buckets. On the other hand, with a low ``num_buckets`` and high ``hashes_per_bucket``, the hashes will be more densely packed into a smaller number of buckets, which not only increases the likelihood of similar strings sharing buckets, but also increases the chances of dissimilar strings being hashed into the same bucket. -* Reduce buckets per shuffle: Because duplicates are still considered bucket by bucket, reducing the ``buckets_per_shuffle`` parameter in the ``FuzzyDuplicatesConfig`` does not affect accuracy. Instead, reducing the buckets per shuffle helps lower the amount of data being transferred between GPUs. However, using a lower ``buckets_per_shuffle`` will increase the time it takes to process the data. -* Adjust files per partition: Processing large datasets in smaller chunks can help reduce the memory load. When reading data with ``DocumentDataset.read_json`` or ``DocumentDataset.read_parquet``, start with a smaller ``files_per_partition`` value and increase as needed. - * When reading your data, we suggest aiming to create partitions no larger than 2GB. For example, if you know each file is ~100MB, then setting ``files_per_partition=20`` would result in partitions that are about 2GB in size. - * For other suggestions on best practices regarding reading data with Dask, please refer to `Dask cuDF Best Practices `_. +- Reduce bucket counts: During deduplication, the data is grouped into buckets to compare and identify near-duplicate documents. Increasing the number of buckets increases the probability that two documents within a given Jaccard similarity score are marked as duplicates. However, increasing the number of buckets also increases the memory requirements from the increased number of hashes it needs to store. Thus, it is important to find an optimal balance between memory usage and deduplication accuracy. You can experiment with this by using the ``num_buckets`` parameter when initializing your ``FuzzyDuplicatesConfig``. + - The user may also need to change the ``hashes_per_bucket`` parameter to match the same Jaccard threshold being aimed for. Think of it like this: with a high ``num_buckets`` and low ``hashes_per_bucket``, the hashes of a string will be spread out across many buckets, which reduces the chances of dissimilar strings being hashed into the same bucket, but increases the chances of similar strings being hashed into different buckets. On the other hand, with a low ``num_buckets`` and high ``hashes_per_bucket``, the hashes will be more densely packed into a smaller number of buckets, which not only increases the likelihood of similar strings sharing buckets, but also increases the chances of dissimilar strings being hashed into the same bucket. +- Reduce buckets per shuffle: Because duplicates are still considered bucket by bucket, reducing the ``buckets_per_shuffle`` parameter in the ``FuzzyDuplicatesConfig`` does not affect accuracy. Instead, reducing the buckets per shuffle helps lower the amount of data being transferred between GPUs. However, using a lower ``buckets_per_shuffle`` will increase the time it takes to process the data. +- Adjust files per partition: Processing large datasets in smaller chunks can help reduce the memory load. When reading data with ``DocumentDataset.read_json`` or ``DocumentDataset.read_parquet``, start with a smaller ``files_per_partition`` value and increase as needed. + - When reading your data, we suggest aiming to create partitions no larger than 2GB. For example, if you know each file is ~100MB, then setting ``files_per_partition=20`` would result in partitions that are about 2GB in size. + - For other suggestions on best practices regarding reading data with Dask, please refer to `Dask cuDF Best Practices `_. Using the ``get_client`` Function ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From d08d45b5f03905932fe8211c1a5d3f32b4f43be4 Mon Sep 17 00:00:00 2001 From: Sarah Yurick Date: Mon, 30 Sep 2024 14:44:58 -0700 Subject: [PATCH 09/13] spacing? Signed-off-by: Sarah Yurick --- docs/user-guide/bestpractices.rst | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/docs/user-guide/bestpractices.rst b/docs/user-guide/bestpractices.rst index d98d01ae1..6cbda1047 100644 --- a/docs/user-guide/bestpractices.rst +++ b/docs/user-guide/bestpractices.rst @@ -53,10 +53,15 @@ Fuzzy deduplication is one of the most computationally expensive algorithms with Here are some suggestions for managing memory use during fuzzy deduplication: - Reduce bucket counts: During deduplication, the data is grouped into buckets to compare and identify near-duplicate documents. Increasing the number of buckets increases the probability that two documents within a given Jaccard similarity score are marked as duplicates. However, increasing the number of buckets also increases the memory requirements from the increased number of hashes it needs to store. Thus, it is important to find an optimal balance between memory usage and deduplication accuracy. You can experiment with this by using the ``num_buckets`` parameter when initializing your ``FuzzyDuplicatesConfig``. + - The user may also need to change the ``hashes_per_bucket`` parameter to match the same Jaccard threshold being aimed for. Think of it like this: with a high ``num_buckets`` and low ``hashes_per_bucket``, the hashes of a string will be spread out across many buckets, which reduces the chances of dissimilar strings being hashed into the same bucket, but increases the chances of similar strings being hashed into different buckets. On the other hand, with a low ``num_buckets`` and high ``hashes_per_bucket``, the hashes will be more densely packed into a smaller number of buckets, which not only increases the likelihood of similar strings sharing buckets, but also increases the chances of dissimilar strings being hashed into the same bucket. + - Reduce buckets per shuffle: Because duplicates are still considered bucket by bucket, reducing the ``buckets_per_shuffle`` parameter in the ``FuzzyDuplicatesConfig`` does not affect accuracy. Instead, reducing the buckets per shuffle helps lower the amount of data being transferred between GPUs. However, using a lower ``buckets_per_shuffle`` will increase the time it takes to process the data. + - Adjust files per partition: Processing large datasets in smaller chunks can help reduce the memory load. When reading data with ``DocumentDataset.read_json`` or ``DocumentDataset.read_parquet``, start with a smaller ``files_per_partition`` value and increase as needed. + - When reading your data, we suggest aiming to create partitions no larger than 2GB. For example, if you know each file is ~100MB, then setting ``files_per_partition=20`` would result in partitions that are about 2GB in size. + - For other suggestions on best practices regarding reading data with Dask, please refer to `Dask cuDF Best Practices `_. Using the ``get_client`` Function From ef0dfd0413c39684bbb978ee911fdc2ae9aba2ea Mon Sep 17 00:00:00 2001 From: Sarah Yurick Date: Mon, 30 Sep 2024 14:54:41 -0700 Subject: [PATCH 10/13] edit bullet pts Signed-off-by: Sarah Yurick --- docs/user-guide/bestpractices.rst | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/docs/user-guide/bestpractices.rst b/docs/user-guide/bestpractices.rst index 6cbda1047..e86db8efc 100644 --- a/docs/user-guide/bestpractices.rst +++ b/docs/user-guide/bestpractices.rst @@ -52,17 +52,15 @@ Fuzzy Deduplication Guidelines Fuzzy deduplication is one of the most computationally expensive algorithms within the NeMo Curator pipeline. Here are some suggestions for managing memory use during fuzzy deduplication: -- Reduce bucket counts: During deduplication, the data is grouped into buckets to compare and identify near-duplicate documents. Increasing the number of buckets increases the probability that two documents within a given Jaccard similarity score are marked as duplicates. However, increasing the number of buckets also increases the memory requirements from the increased number of hashes it needs to store. Thus, it is important to find an optimal balance between memory usage and deduplication accuracy. You can experiment with this by using the ``num_buckets`` parameter when initializing your ``FuzzyDuplicatesConfig``. +* Reduce bucket counts: During deduplication, the data is grouped into buckets to compare and identify near-duplicate documents. Increasing the number of buckets increases the probability that two documents within a given Jaccard similarity score are marked as duplicates. However, increasing the number of buckets also increases the memory requirements from the increased number of hashes it needs to store. Thus, it is important to find an optimal balance between memory usage and deduplication accuracy. You can experiment with this by using the ``num_buckets`` parameter when initializing your ``FuzzyDuplicatesConfig``. - - The user may also need to change the ``hashes_per_bucket`` parameter to match the same Jaccard threshold being aimed for. Think of it like this: with a high ``num_buckets`` and low ``hashes_per_bucket``, the hashes of a string will be spread out across many buckets, which reduces the chances of dissimilar strings being hashed into the same bucket, but increases the chances of similar strings being hashed into different buckets. On the other hand, with a low ``num_buckets`` and high ``hashes_per_bucket``, the hashes will be more densely packed into a smaller number of buckets, which not only increases the likelihood of similar strings sharing buckets, but also increases the chances of dissimilar strings being hashed into the same bucket. + * The user may also need to change the ``hashes_per_bucket`` parameter to match the same Jaccard threshold being aimed for. Think of it like this: with a high ``num_buckets`` and low ``hashes_per_bucket``, the hashes of a string will be spread out across many buckets, which reduces the chances of dissimilar strings being hashed into the same bucket, but increases the chances of similar strings being hashed into different buckets. On the other hand, with a low ``num_buckets`` and high ``hashes_per_bucket``, the hashes will be more densely packed into a smaller number of buckets, which not only increases the likelihood of similar strings sharing buckets, but also increases the chances of dissimilar strings being hashed into the same bucket. -- Reduce buckets per shuffle: Because duplicates are still considered bucket by bucket, reducing the ``buckets_per_shuffle`` parameter in the ``FuzzyDuplicatesConfig`` does not affect accuracy. Instead, reducing the buckets per shuffle helps lower the amount of data being transferred between GPUs. However, using a lower ``buckets_per_shuffle`` will increase the time it takes to process the data. +* Reduce buckets per shuffle: Because duplicates are still considered bucket by bucket, reducing the ``buckets_per_shuffle`` parameter in the ``FuzzyDuplicatesConfig`` does not affect accuracy. Instead, reducing the buckets per shuffle helps lower the amount of data being transferred between GPUs. However, using a lower ``buckets_per_shuffle`` will increase the time it takes to process the data. +* Adjust files per partition: Processing large datasets in smaller chunks can help reduce the memory load. When reading data with ``DocumentDataset.read_json`` or ``DocumentDataset.read_parquet``, start with a smaller ``files_per_partition`` value and increase as needed. -- Adjust files per partition: Processing large datasets in smaller chunks can help reduce the memory load. When reading data with ``DocumentDataset.read_json`` or ``DocumentDataset.read_parquet``, start with a smaller ``files_per_partition`` value and increase as needed. - - - When reading your data, we suggest aiming to create partitions no larger than 2GB. For example, if you know each file is ~100MB, then setting ``files_per_partition=20`` would result in partitions that are about 2GB in size. - - - For other suggestions on best practices regarding reading data with Dask, please refer to `Dask cuDF Best Practices `_. + * When reading your data, we suggest aiming to create partitions no larger than 2GB. For example, if you know each file is ~100MB, then setting ``files_per_partition=20`` would result in partitions that are about 2GB in size. + * For other suggestions on best practices regarding reading data with Dask, please refer to `Dask cuDF Best Practices `_. Using the ``get_client`` Function ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From f7eeca07b98eedfeec4f5e67bc7f2b0c23a4f194 Mon Sep 17 00:00:00 2001 From: Sarah Yurick Date: Mon, 30 Sep 2024 15:01:46 -0700 Subject: [PATCH 11/13] hyphens Signed-off-by: Sarah Yurick --- docs/user-guide/bestpractices.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/user-guide/bestpractices.rst b/docs/user-guide/bestpractices.rst index e86db8efc..3bf13c702 100644 --- a/docs/user-guide/bestpractices.rst +++ b/docs/user-guide/bestpractices.rst @@ -52,12 +52,12 @@ Fuzzy Deduplication Guidelines Fuzzy deduplication is one of the most computationally expensive algorithms within the NeMo Curator pipeline. Here are some suggestions for managing memory use during fuzzy deduplication: -* Reduce bucket counts: During deduplication, the data is grouped into buckets to compare and identify near-duplicate documents. Increasing the number of buckets increases the probability that two documents within a given Jaccard similarity score are marked as duplicates. However, increasing the number of buckets also increases the memory requirements from the increased number of hashes it needs to store. Thus, it is important to find an optimal balance between memory usage and deduplication accuracy. You can experiment with this by using the ``num_buckets`` parameter when initializing your ``FuzzyDuplicatesConfig``. +- Reduce bucket counts: During deduplication, the data is grouped into buckets to compare and identify near-duplicate documents. Increasing the number of buckets increases the probability that two documents within a given Jaccard similarity score are marked as duplicates. However, increasing the number of buckets also increases the memory requirements from the increased number of hashes it needs to store. Thus, it is important to find an optimal balance between memory usage and deduplication accuracy. You can experiment with this by using the ``num_buckets`` parameter when initializing your ``FuzzyDuplicatesConfig``. * The user may also need to change the ``hashes_per_bucket`` parameter to match the same Jaccard threshold being aimed for. Think of it like this: with a high ``num_buckets`` and low ``hashes_per_bucket``, the hashes of a string will be spread out across many buckets, which reduces the chances of dissimilar strings being hashed into the same bucket, but increases the chances of similar strings being hashed into different buckets. On the other hand, with a low ``num_buckets`` and high ``hashes_per_bucket``, the hashes will be more densely packed into a smaller number of buckets, which not only increases the likelihood of similar strings sharing buckets, but also increases the chances of dissimilar strings being hashed into the same bucket. -* Reduce buckets per shuffle: Because duplicates are still considered bucket by bucket, reducing the ``buckets_per_shuffle`` parameter in the ``FuzzyDuplicatesConfig`` does not affect accuracy. Instead, reducing the buckets per shuffle helps lower the amount of data being transferred between GPUs. However, using a lower ``buckets_per_shuffle`` will increase the time it takes to process the data. -* Adjust files per partition: Processing large datasets in smaller chunks can help reduce the memory load. When reading data with ``DocumentDataset.read_json`` or ``DocumentDataset.read_parquet``, start with a smaller ``files_per_partition`` value and increase as needed. +- Reduce buckets per shuffle: Because duplicates are still considered bucket by bucket, reducing the ``buckets_per_shuffle`` parameter in the ``FuzzyDuplicatesConfig`` does not affect accuracy. Instead, reducing the buckets per shuffle helps lower the amount of data being transferred between GPUs. However, using a lower ``buckets_per_shuffle`` will increase the time it takes to process the data. +- Adjust files per partition: Processing large datasets in smaller chunks can help reduce the memory load. When reading data with ``DocumentDataset.read_json`` or ``DocumentDataset.read_parquet``, start with a smaller ``files_per_partition`` value and increase as needed. * When reading your data, we suggest aiming to create partitions no larger than 2GB. For example, if you know each file is ~100MB, then setting ``files_per_partition=20`` would result in partitions that are about 2GB in size. * For other suggestions on best practices regarding reading data with Dask, please refer to `Dask cuDF Best Practices `_. From a8c87bf9ee4cd9ae8d6512636179eeafccf042ee Mon Sep 17 00:00:00 2001 From: Sarah Yurick Date: Mon, 30 Sep 2024 15:02:47 -0700 Subject: [PATCH 12/13] hyphens Signed-off-by: Sarah Yurick --- docs/user-guide/bestpractices.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/user-guide/bestpractices.rst b/docs/user-guide/bestpractices.rst index 3bf13c702..0682425b3 100644 --- a/docs/user-guide/bestpractices.rst +++ b/docs/user-guide/bestpractices.rst @@ -54,13 +54,13 @@ Here are some suggestions for managing memory use during fuzzy deduplication: - Reduce bucket counts: During deduplication, the data is grouped into buckets to compare and identify near-duplicate documents. Increasing the number of buckets increases the probability that two documents within a given Jaccard similarity score are marked as duplicates. However, increasing the number of buckets also increases the memory requirements from the increased number of hashes it needs to store. Thus, it is important to find an optimal balance between memory usage and deduplication accuracy. You can experiment with this by using the ``num_buckets`` parameter when initializing your ``FuzzyDuplicatesConfig``. - * The user may also need to change the ``hashes_per_bucket`` parameter to match the same Jaccard threshold being aimed for. Think of it like this: with a high ``num_buckets`` and low ``hashes_per_bucket``, the hashes of a string will be spread out across many buckets, which reduces the chances of dissimilar strings being hashed into the same bucket, but increases the chances of similar strings being hashed into different buckets. On the other hand, with a low ``num_buckets`` and high ``hashes_per_bucket``, the hashes will be more densely packed into a smaller number of buckets, which not only increases the likelihood of similar strings sharing buckets, but also increases the chances of dissimilar strings being hashed into the same bucket. + - The user may also need to change the ``hashes_per_bucket`` parameter to match the same Jaccard threshold being aimed for. Think of it like this: with a high ``num_buckets`` and low ``hashes_per_bucket``, the hashes of a string will be spread out across many buckets, which reduces the chances of dissimilar strings being hashed into the same bucket, but increases the chances of similar strings being hashed into different buckets. On the other hand, with a low ``num_buckets`` and high ``hashes_per_bucket``, the hashes will be more densely packed into a smaller number of buckets, which not only increases the likelihood of similar strings sharing buckets, but also increases the chances of dissimilar strings being hashed into the same bucket. - Reduce buckets per shuffle: Because duplicates are still considered bucket by bucket, reducing the ``buckets_per_shuffle`` parameter in the ``FuzzyDuplicatesConfig`` does not affect accuracy. Instead, reducing the buckets per shuffle helps lower the amount of data being transferred between GPUs. However, using a lower ``buckets_per_shuffle`` will increase the time it takes to process the data. - Adjust files per partition: Processing large datasets in smaller chunks can help reduce the memory load. When reading data with ``DocumentDataset.read_json`` or ``DocumentDataset.read_parquet``, start with a smaller ``files_per_partition`` value and increase as needed. - * When reading your data, we suggest aiming to create partitions no larger than 2GB. For example, if you know each file is ~100MB, then setting ``files_per_partition=20`` would result in partitions that are about 2GB in size. - * For other suggestions on best practices regarding reading data with Dask, please refer to `Dask cuDF Best Practices `_. + - When reading your data, we suggest aiming to create partitions no larger than 2GB. For example, if you know each file is ~100MB, then setting ``files_per_partition=20`` would result in partitions that are about 2GB in size. + - For other suggestions on best practices regarding reading data with Dask, please refer to `Dask cuDF Best Practices `_. Using the ``get_client`` Function ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From 4ac39393ad439fa4d255fb48ac55a2371595b20d Mon Sep 17 00:00:00 2001 From: Sarah Yurick Date: Mon, 30 Sep 2024 15:04:36 -0700 Subject: [PATCH 13/13] 2 spaces Signed-off-by: Sarah Yurick --- docs/user-guide/bestpractices.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/user-guide/bestpractices.rst b/docs/user-guide/bestpractices.rst index 0682425b3..533777b24 100644 --- a/docs/user-guide/bestpractices.rst +++ b/docs/user-guide/bestpractices.rst @@ -54,13 +54,13 @@ Here are some suggestions for managing memory use during fuzzy deduplication: - Reduce bucket counts: During deduplication, the data is grouped into buckets to compare and identify near-duplicate documents. Increasing the number of buckets increases the probability that two documents within a given Jaccard similarity score are marked as duplicates. However, increasing the number of buckets also increases the memory requirements from the increased number of hashes it needs to store. Thus, it is important to find an optimal balance between memory usage and deduplication accuracy. You can experiment with this by using the ``num_buckets`` parameter when initializing your ``FuzzyDuplicatesConfig``. - - The user may also need to change the ``hashes_per_bucket`` parameter to match the same Jaccard threshold being aimed for. Think of it like this: with a high ``num_buckets`` and low ``hashes_per_bucket``, the hashes of a string will be spread out across many buckets, which reduces the chances of dissimilar strings being hashed into the same bucket, but increases the chances of similar strings being hashed into different buckets. On the other hand, with a low ``num_buckets`` and high ``hashes_per_bucket``, the hashes will be more densely packed into a smaller number of buckets, which not only increases the likelihood of similar strings sharing buckets, but also increases the chances of dissimilar strings being hashed into the same bucket. + - The user may also need to change the ``hashes_per_bucket`` parameter to match the same Jaccard threshold being aimed for. Think of it like this: with a high ``num_buckets`` and low ``hashes_per_bucket``, the hashes of a string will be spread out across many buckets, which reduces the chances of dissimilar strings being hashed into the same bucket, but increases the chances of similar strings being hashed into different buckets. On the other hand, with a low ``num_buckets`` and high ``hashes_per_bucket``, the hashes will be more densely packed into a smaller number of buckets, which not only increases the likelihood of similar strings sharing buckets, but also increases the chances of dissimilar strings being hashed into the same bucket. - Reduce buckets per shuffle: Because duplicates are still considered bucket by bucket, reducing the ``buckets_per_shuffle`` parameter in the ``FuzzyDuplicatesConfig`` does not affect accuracy. Instead, reducing the buckets per shuffle helps lower the amount of data being transferred between GPUs. However, using a lower ``buckets_per_shuffle`` will increase the time it takes to process the data. - Adjust files per partition: Processing large datasets in smaller chunks can help reduce the memory load. When reading data with ``DocumentDataset.read_json`` or ``DocumentDataset.read_parquet``, start with a smaller ``files_per_partition`` value and increase as needed. - - When reading your data, we suggest aiming to create partitions no larger than 2GB. For example, if you know each file is ~100MB, then setting ``files_per_partition=20`` would result in partitions that are about 2GB in size. - - For other suggestions on best practices regarding reading data with Dask, please refer to `Dask cuDF Best Practices `_. + - When reading your data, we suggest aiming to create partitions no larger than 2GB. For example, if you know each file is ~100MB, then setting ``files_per_partition=20`` would result in partitions that are about 2GB in size. + - For other suggestions on best practices regarding reading data with Dask, please refer to `Dask cuDF Best Practices `_. Using the ``get_client`` Function ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~