diff --git a/README.md b/README.md index b530cc8..b79a89a 100644 --- a/README.md +++ b/README.md @@ -55,9 +55,22 @@ by active learning (by developers of Spacy), text and image (Feature extraction could be computationally expensive and nearly impossible to scale, hence re-using features by different models and teams is a key to high performance ML teams). * [FEAST](https://github.com/gojek/feast) (Google cloud, Open Source) * [Michelangelo Palette](https://eng.uber.com/michelangelo/) (Uber) + * [Hopsworks](https://github.com/logicalclocks/hopsworks) (Hopsworks, Open Source) * Suggestion: At training time, copy data into a local or networked **filesystem** (NFS). [1](#fsdl) -### 1.4. Data Versioning +### 1.4. File Formats +* Different file formats may be used along different parts of the ML Pipeline, as shown in the figure below: +
+ +
+ +* Feature engineering may be performed on legacy tabular file formats like .csv, or modern columnar formats like .parquet, .orc. Nested file formats like .json, .avro are less often used. + +* Models are typically trained with data in files, and different frameworks have different native file formats. TensorFlow favors .tfrecords, PyTorch favors .npy, Scikit-Learn favors .csv files. Uber released .petastorm as a columnar file format with native readers for TensorFlow/Keras and PyTorch. + +* File formats for model serving include: .pb (TensorFlow), .onnx (framework independent), .pkl (Scikit-Learn - picked python objects), and legacy formats such as .pmml. + +### 1.5. Data Versioning * It's a "MUST" for deployed ML models: **Deployed ML models are part code, part data**. [1](#fsdl) No data versioning means no model versioning. * Data versioning platforms: @@ -65,7 +78,7 @@ by active learning (by developers of Spacy), text and image * [Pachyderm](https://www.pachyderm.com/): version control for data * [Dolt](https://www.liquidata.co/): versioning for SQL database -### 1.5. Data Processing +### 1.6. Data Processing * Training data for production models may come from different sources, including *Stored data in db and object stores*, *log processing*, and *outputs of other classifiers*. * There are dependencies between tasks, each needs to be kicked off after its dependencies are finished. For example, training on new log data, requires a preprocessing step before training. * Makefiles are not scalable. "Workflow manager"s become pretty essential in this regard. @@ -133,6 +146,7 @@ by active learning (by developers of Spacy), text and image * [Comet](https://www.comet.ml/): lets you track code, experiments, and results on ML projects * [Weights & Biases](https://www.wandb.com/): Record and visualize every detail of your research with easy collaboration * [MLFlow Tracking](https://www.mlflow.org/docs/latest/tracking.html#tracking): for logging parameters, code versions, metrics, and output files as well as visualization of the results. + * [Hopsworks Experiments](https://github.com/logicalclocks/hopsworks): for logging hyperparameters, results, notebooks, datasets/features used for training, and any output files/images. * Automatic experiment tracking with one line of code in python * Side by side comparison of experiments * Hyper parameter tuning @@ -144,6 +158,7 @@ by active learning (by developers of Spacy), text and image * Random search * Bayesian optimization * HyperBand + * Asynchronous Successive Halving * Platforms: * [Katib](https://github.com/kubeflow/katib): Kubernete's Native System for Hyperparameter Tuning and Neural Architecture Search, inspired by [Google vizier](https://static.googleusercontent.com/media/ research.google.com/ja//pubs/archive/ bcb15507f4b52991a0783013df4222240e942381.pdf) and supports multiple ML/DL frameworks (e.g. TensorFlow, MXNet, and PyTorch). @@ -152,14 +167,16 @@ by active learning (by developers of Spacy), text and image * [Ray-Tune](https://github.com/ray-project/ray/tree/master/python/ray/ tune): A scalable research platform for distributed model selection (with a focus on deep learning and deep reinforcement learning) * [Sweeps](https://docs.wandb.com/library/sweeps) from [Weights & Biases] (https://www.wandb.com/): Parameters are not explicitly specified by a developer. Instead they are approximated and learned by a machine learning model. * [Keras Tuner](https://github.com/keras-team/keras-tuner): A hyperparameter tuner for Keras, specifically for tf.keras with TensorFlow 2.0. - + * [Maggy](https://github.com/logicalclocks/maggy): An asynchronous parallel hyperparameter tuning framework for TensorFlow/Keras, built on PySpark. + ### 2.6. Distributed Training * Data parallelism: Use it when iteration time is too long (both tensorflow and PyTorch support) * Model parallelism: when model does not fit on a single GPU * Other solutions: * Ray * Horovod - + * [TensorFlow CollectiveAllReduce on PySpark](https://www.logicalclocks.com/blog/goodbye-horovod-hello-collectiveallreduce) + ## 3. Troubleshooting [TBD] ## 4. Testing and Deployment @@ -231,6 +248,7 @@ Machine Learning production software requires a more diverse set of test suites * Catching service and data regressions * Cloud providers solutions are decent * [Kiali](https://kiali.io/):an observability console for Istio with service mesh configuration capabilities. It answers these questions: How are the microservices connected? How are they performing? +* [Hopsworks](https://github.com/logicalclocks/hopsworks): online models have their predictions logged to a Kafka topic, and a Spark Streaming of Flink application monitors the model for concept drift, data drift, model drift, and other anomalies. #### Are we done?diff --git a/images/file-formats.png b/images/file-formats.png new file mode 100644 index 0000000..d9277f7 Binary files /dev/null and b/images/file-formats.png differ