This repo houses a script used to turn any Paperspace Ubuntu 20.04 based Linux instance into a fully functional machine learning environment for interactive development. Users are free to modify and run against their own Paperspace instance, be it Linux headless or Linux desktop. The only requirement is a working Nvidia GPU driver pre-installed. A pre-built template exists in the Paperspace eco-system as a Public Template based on a base Ubuntu 20.04 image.
We assume a generic advanced data science user who probably wants GPU access, but not any particular specialized subfield of data science such as computer vision or natural language processing. Such users can build upon this base to create their own stack, or we can create other VMs for subfields, similar to what can be done with Gradient containers.
Category | Software | Version | Install Method | Why / Notes |
---|---|---|---|---|
GPU | NVidia Driver | 515.65.01 | pre-installed | Enable Nvidia GPUs. Latest version as of VM creation date |
CUDA | 11.7.1 | Install script | Nvidia A100 GPUs require CUDA 11+ to work, so 10.x is not suitable | |
CUDA toolkit | 11.7.1 | Apt | Needed for nvcc command for cuDNN |
|
cuDNN | 8.5.0.*-1+cuda11.7 | Ubuntu repo | Nvidia GPU deep learning library | |
Infra | Docker Engine CE | 20.10.8, build 3967b7d | pre-installed | Docker Engine community edition |
NVidia Docker | 2.6.0-1 | pre-installed | Enable NVidia GPU in Docker containers | |
Python | Python | 3.9.14 | Apt | Most widely used programming language for data science |
pip3 | 22.2.2 | Apt | Enable easy installation of 1000s of other data science, etc., packages. | |
NumPy | 1.23.2 | pip3 | Handle arrays, matrices, etc., in Python | |
SciPy | 1.9.1 | pip3 | Fundamental algorithms for scientific computing in Python | |
Pandas | 1.4.4 | pip3 | De facto standard for data science data exploration/preparation in Python | |
Cloudpickle | 2.1.0 | pip3 | Makes it possible to serialize Python constructs not supported by the default pickle module | |
Matplotlib | 3.5.3 | pip3 | Widely used plotting library in Python for data science, e.g., scikit-learn plotting requires it | |
Ipython | 8.5.0 | pip3 | Provides a rich architecture for interactive computing | |
IPykernel | 6.15.2 | pip3 | Provides the IPython kernel for Jupyter. | |
IPywidgets | 8.0.2 | pip3 | Interactive HTML widgets for Jupyter notebooks and the IPython kernel | |
Cython | 0.29.32 | pip3 | Enables writing C extensions for Python | |
tqdm | 4.64.1 | pip3 | Fast, extensible progress meter | |
gdown | 4.5.1 | pip3 | Google drive direct download of big files | |
Pillow | 9.2.0 | pip3 | Python imaging library | |
seaborn | 0.12.0 | pip3 | Python visualization library based on matplotlib | |
SQLAlchemy | 1.4.40 | pip3 | Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL | |
spaCy | 3.4.1 | pip3 | library for advanced Natural Language Processing in Python and Cython | |
nltk | 3.7 | pip3 | Natural Language Toolkit (NLTK) is a Python package for natural language processing | |
boto3 | 1.24.66 | pip3 | Amazon Web Services (AWS) Software Development Kit (SDK) for Python | |
tabulate | 0.8.10 | pip3 | Pretty-print tabular data in Python | |
future | 0.18.2 | pip3 | The missing compatibility layer between Python 2 and Python 3 | |
gradient | 2.0.6 | pip3 | CLI and Python SDK for Paperspace Core and Gradient | |
jsonify | 0.5 | pip3 | Provides the ability to take a .csv file as input and outputs a file with the same data in .json format | |
opencv-python | 4.6.0.66 | pip3 | Includes several hundreds of computer vision algorithms | |
pyyaml | 5.4.1 | pip3 | YAML parser and emitter for Python | |
JupyterLab | 3.4.6 | pip3 | De facto standard for data science using Jupyter notebooks | |
wandb | 0.13.4 | pip3 | CLI and library to interact with the Weights & Biases API (model tracking) | |
Machine Learning | Scikit-learn | 1.1.2 | pip3 | Widely used ML library for data science, generally for smaller data or models |
Scikit-image | 0.19.3 | pip3 | Collection of algorithms for image processing | |
TensorFlow | 2.9.2 | pip3 | Most widely used deep learning library, alongside PyTorch | |
PyTorch | 1.12.1 | pip3 | Most widely used deep learning library, alongside TensorFlow | |
Jax | 0.3.17 | pip3 | Popular deep learning library brought to you by Google | |
Transformers | 4.21.3 | pip3 | Popular deep learning library for NLP brought to you by HuggingFace | |
Datasets | 2.4.0 | pip3 | A supporting library for NLP use cases and the Transformers library brought to you by HuggingFace | |
XGBoost | 1.6.2 | pip3 | An optimized distributed gradient boosting library | |
Sentence Transformers | 2.2.2 | pip3 | A ML framework for sentence, paragraph and image embeddings |
Information about license types:
Apache 2.0: https://opensource.org/licenses/Apache-2.0
MIT: https://opensource.org/licenses/MIT
New BSD: https://opensource.org/licenses/BSD-3-Clause
PSF = Python Software Foundation: https://en.wikipedia.org/wiki/Python_Software_Foundation_License
HPND = Historical Permission Notice and Disclaimer: https://opensource.org/licenses/HPND
ISC: https://opensource.org/licenses/ISC
Open source software can be used for commercial purposes: https://opensource.org/docs/osd#fields-of-endeavor.
Other software considered but not included.
The potential data science stack is far larger than any one person will use so we don't attempt to cover everything here.
Some generic categories of software not included:
- Non-data-science software
- Commercial software
- Software not licensed to be used on an available VM template
- Software only used in particular specialized data science subfields (although we assume our users probably want a GPU)
Category | Software | Why Not |
---|---|---|
Apache | Kafka, Parquet | |
Classifiers | libsvm | H2O contains SVM and GBM, save on installs |
Collections | ELKI, GNU Octave, Weka, Mahout | |
Connectors | Academic Torrents | |
Dashboarding | panel, dash, voila, streamlit | |
Databases | MySQL, Hive, PostgreSQL, Prometheus, Neo4j, MongoDB, Cassandra, Redis | No particular infra to connect to databases |
Deep Learning | Caffe, Caffe2, Theano, PaddlePaddle, Chainer, MXNet | PyTorch and TensorFlow are dominant, rest niche |
Deployment | Dash, TFServing, R Shiny, Flask | Use Gradient Deployments |
Distributed. | Horovod, OpenMPI | Use Gradient distributed |
Feature store | Feast | |
IDEs | PyCharm, Spyder, RStudio | |
Interpretability | LIME/SHAP, Fairlearn, AI Fairness 360, InterpretML | |
Languages | R, SQL, Julia, C++, JavaScript, Python2, Scala | Python is dominant for data science |
Monitoring | Grafana | |
NLP | GenSim | |
Notebooks | Jupyter, Zeppelin | JupyterLab includes Jupyter notebook |
Orchestrators | Kubernetes | Use Gradient cluster |
Partners | fast.ai | Could add if we want partner functionality |
Pipelines | AirFlow, MLFlow, Intake, Kubeflow | |
Python libraries | statsmodels, pymc3, geopandas, Geopy, LIBSVM | Too many to attempt to cover |
PyTorch extensions | Lightning | |
R packages | ggplot, tidyverse | Could add R if customer demand |
Recommenders | TFRS, scikit-surprise | |
Scalable | Dask, Numba, Spark 1 or 2, Koalas, Hadoop | |
TensorFlow | TF 1.15, Recommenders, TensorBoard, TensorRT | Could add TensorFlow 1.x if customer demand. Requires separate tensorflow-gpu for GPU support. |
Viz | Bokeh, Plotly, Holoviz (Datashader), Google FACETS, Excalidraw, GraphViz, ggplot2, d3.js |