-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix typos and make some texts more explicit (#62)
* fix typos
- Loading branch information
1 parent
138abaf
commit 901766c
Showing
31 changed files
with
171 additions
and
150 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,8 +1,8 @@ | ||
(sec-dask-ml-preprocessing)= | ||
# 数据预处理 | ||
|
||
{numref}`sec-data-science-lifecycle` 我们提到过,数据科学工作的重点是理解数据和处理数据,Dask 可以将很多单机的任务横向扩展到集群上,并且可以和 Python 社区数据可视化等库结合,完成探索性数据分析。 | ||
在 {numref}`sec-data-science-lifecycle` 中提到,数据科学工作的核心在于理解数据和处理数据。Dask 能够将许多单机任务扩展到集群上执行,并能与 Python 社区中的数据可视化等库结合,以完成探索性数据分析。 | ||
|
||
分布式数据预处理部分更多依赖 Dask DataFrame 和 Dask Array 的能力,这里不再赘述。 | ||
在分布式数据预处理方面,更多地依赖于 Dask DataFrame 和 Dask Array 的功能,这一点在此不再赘述。 | ||
|
||
特征工程部分,Dask-ML 实现了很多 `sklearn.preprocessing` 的 API,比如 [`MinMaxScaler`](https://ml.dask.org/modules/generated/dask_ml.preprocessing.MinMaxScaler.html)。对 Dask 而言,稍有不同的是其独热编码,本书写作时,Dask 使用 [`DummyEncoder`](https://ml.dask.org/modules/generated/dask_ml.preprocessing.DummyEncoder.html) 对类别特征进行独热编码,`DummyEncoder` 是 scikit-learn `OneHotEncoder` 的 Dask 替代。我们将在 {numref}`sec-dask-ml-hyperparameter` 将展示一个类型特征的案例。 | ||
在特征工程部分,Dask-ML 实现了很多 `sklearn.preprocessing` 的 API,比如 [`MinMaxScaler`](https://ml.dask.org/modules/generated/dask_ml.preprocessing.MinMaxScaler.html)。对 Dask 来说,一个稍有不同的地方是其独热编码的实现。截至本书写作时,Dask 使用 [`DummyEncoder`](https://ml.dask.org/modules/generated/dask_ml.preprocessing.DummyEncoder.html) 对类别特征进行独热编码,`DummyEncoder` 是 scikit-learn `OneHotEncoder` 的 Dask 替代。我们将在 {numref}`sec-dask-ml-hyperparameter` 展示一个关于类型特征的案例。 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,12 @@ | ||
# MPI 与大模型 | ||
|
||
本章主要解释大模型的并行方法。大模型指的是神经网络的参数量很大,必须并行地进行训练和推理。大模型并行有如下特点: | ||
本章主要解释大模型的并行方法。所谓大模型,指的是参数量庞大的神经网络,它们必须通过并行方式进行训练和推理。大模型并行具有以下几个特点: | ||
|
||
* 计算运行在 GPU 这样的加速卡上; | ||
* 加速卡非常昂贵,应尽量提高加速卡的利用率; | ||
* 模型参数量大,无论是训练还是推理,可能有大量数据需要在加速卡之间传输,对带宽和延迟的要求都很高。 | ||
* 计算运行在 GPU 这样的加速卡上,这些硬件专为提高计算效率而设计。; | ||
* 加速卡的成本非常高昂,因此应努力提高其利用率,确保投资的回报。 | ||
* 由于模型参数量巨大,在训练或推理过程中,可能需要在加速卡之间传输大量数据,这要求有很高的带宽和低延迟以保证效率。 | ||
|
||
本章主要从概念和原理上进行解读,具体的实现可参考其他论文和开源库。 | ||
本章将从概念和原理上进行详细解读,而具体的实现细节可以参考其他学术论文和开源库。 | ||
|
||
```{tableofcontents} | ||
``` |
Oops, something went wrong.