Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Hopsworks platform #5

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 22 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,17 +55,30 @@ by active learning (by developers of Spacy), text and image
(Feature extraction could be computationally expensive and nearly impossible to scale, hence re-using features by different models and teams is a key to high performance ML teams).
* [FEAST](https://github.com/gojek/feast) (Google cloud, Open Source)
* [Michelangelo Palette](https://eng.uber.com/michelangelo/) (Uber)
* [Hopsworks](https://github.com/logicalclocks/hopsworks) (Hopsworks, Open Source)
* Suggestion: At training time, copy data into a local or networked **filesystem** (NFS). <sup>[1](#fsdl)</sup>

### 1.4. Data Versioning
### 1.4. File Formats
* Different file formats may be used along different parts of the ML Pipeline, as shown in the figure below:
<p align="center">
<img src="https://github.com/alirezadir/Production-Level-Deep-Learning/blob/master/images/file-formats.png" title="" width="95%" height="95%">
</p>

* Feature engineering may be performed on legacy tabular file formats like .csv, or modern columnar formats like .parquet, .orc. Nested file formats like .json, .avro are less often used.

* Models are typically trained with data in files, and different frameworks have different native file formats. TensorFlow favors .tfrecords, PyTorch favors .npy, Scikit-Learn favors .csv files. Uber released .petastorm as a columnar file format with native readers for TensorFlow/Keras and PyTorch.

* File formats for model serving include: .pb (TensorFlow), .onnx (framework independent), .pkl (Scikit-Learn - picked python objects), and legacy formats such as .pmml.

### 1.5. Data Versioning
* It's a "MUST" for deployed ML models:
**Deployed ML models are part code, part data**. <sup>[1](#fsdl)</sup> No data versioning means no model versioning.
* Data versioning platforms:
* [DVC](https://dvc.org/): Open source version control system for ML projects
* [Pachyderm](https://www.pachyderm.com/): version control for data
* [Dolt](https://www.liquidata.co/): versioning for SQL database

### 1.5. Data Processing
### 1.6. Data Processing
* Training data for production models may come from different sources, including *Stored data in db and object stores*, *log processing*, and *outputs of other classifiers*.
* There are dependencies between tasks, each needs to be kicked off after its dependencies are finished. For example, training on new log data, requires a preprocessing step before training.
* Makefiles are not scalable. "Workflow manager"s become pretty essential in this regard.
Expand Down Expand Up @@ -133,6 +146,7 @@ by active learning (by developers of Spacy), text and image
* [Comet](https://www.comet.ml/): lets you track code, experiments, and results on ML projects
* [Weights & Biases](https://www.wandb.com/): Record and visualize every detail of your research with easy collaboration
* [MLFlow Tracking](https://www.mlflow.org/docs/latest/tracking.html#tracking): for logging parameters, code versions, metrics, and output files as well as visualization of the results.
* [Hopsworks Experiments](https://github.com/logicalclocks/hopsworks): for logging hyperparameters, results, notebooks, datasets/features used for training, and any output files/images.
* Automatic experiment tracking with one line of code in python
* Side by side comparison of experiments
* Hyper parameter tuning
Expand All @@ -144,6 +158,7 @@ by active learning (by developers of Spacy), text and image
* Random search
* Bayesian optimization
* HyperBand
* Asynchronous Successive Halving

* Platforms:
* [Katib](https://github.com/kubeflow/katib): Kubernete's Native System for Hyperparameter Tuning and Neural Architecture Search, inspired by [Google vizier](https://static.googleusercontent.com/media/ research.google.com/ja//pubs/archive/ bcb15507f4b52991a0783013df4222240e942381.pdf) and supports multiple ML/DL frameworks (e.g. TensorFlow, MXNet, and PyTorch).
Expand All @@ -152,14 +167,16 @@ by active learning (by developers of Spacy), text and image
* [Ray-Tune](https://github.com/ray-project/ray/tree/master/python/ray/ tune): A scalable research platform for distributed model selection (with a focus on deep learning and deep reinforcement learning)
* [Sweeps](https://docs.wandb.com/library/sweeps) from [Weights & Biases] (https://www.wandb.com/): Parameters are not explicitly specified by a developer. Instead they are approximated and learned by a machine learning model.
* [Keras Tuner](https://github.com/keras-team/keras-tuner): A hyperparameter tuner for Keras, specifically for tf.keras with TensorFlow 2.0.

* [Maggy](https://github.com/logicalclocks/maggy): An asynchronous parallel hyperparameter tuning framework for TensorFlow/Keras, built on PySpark.

### 2.6. Distributed Training
* Data parallelism: Use it when iteration time is too long (both tensorflow and PyTorch support)
* Model parallelism: when model does not fit on a single GPU
* Other solutions:
* Ray
* Horovod

* [TensorFlow CollectiveAllReduce on PySpark](https://www.logicalclocks.com/blog/goodbye-horovod-hello-collectiveallreduce)

## 3. Troubleshooting [TBD]

## 4. Testing and Deployment
Expand Down Expand Up @@ -231,6 +248,7 @@ Machine Learning production software requires a more diverse set of test suites
* Catching service and data regressions
* Cloud providers solutions are decent
* [Kiali](https://kiali.io/):an observability console for Istio with service mesh configuration capabilities. It answers these questions: How are the microservices connected? How are they performing?
* [Hopsworks](https://github.com/logicalclocks/hopsworks): online models have their predictions logged to a Kafka topic, and a Spark Streaming of Flink application monitors the model for concept drift, data drift, model drift, and other anomalies.

#### Are we done?
<p align="center">
Expand Down
Binary file added images/file-formats.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.