diff --git a/colabs/wandb-artifacts/Artifact_fundamentals.ipynb b/colabs/wandb-artifacts/Artifact_fundamentals.ipynb new file mode 100644 index 00000000..da426a18 --- /dev/null +++ b/colabs/wandb-artifacts/Artifact_fundamentals.ipynb @@ -0,0 +1,429 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\"Open\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\"Weights\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Use [Weights & Biases](https://wandb.com) for machine learning experiment tracking, dataset and model versioning and management, collaboration and more.\n", + "\n", + "
\n", + "\n", + "\"Weights\n", + "\n", + "
\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Use W&B Artifacts to track and version data as the inputs and outputs of your W&B Runs. In addition to logging hyperparameters, metadata, and metrics to a run, you can use an artifact to log the dataset used to train the model as input and the resulting model checkpoints as outputs.\n", + "\n", + "![Artifact Simple Diagram](https://docs.wandb.ai/assets/images/artifacts_landing_page2-05443aa39ae53cede7b08908688b334a.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Set Up" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In order to use Weights & Biases, you will need the `wandb` package installed. You can install it as follows within Colab." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install wandb" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Once it is installed, the next step is to import it into your script or notebook with `import wandb`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import wandb" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We also need to authenticate to the Weights & Biases server. There are various ways of doing this, including for [remote or non-interactice workflows](https://docs.wandb.ai/guides/track/environment-variables), but given this is running interactively, we can use `wandb.login()`.\n", + "\n", + "If we are not already authenticated, a link will appear which you can use to do so." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "wandb.login()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Create a Dataset\n", + "Let's create some datasets that we can work with in this example." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import numpy as np\n", + "import csv\n", + "\n", + "directory = \"dataset\"\n", + "os.makedirs(directory, exist_ok=True)\n", + "file1, file2 = os.path.join(directory, \"file1.csv\"), os.path.join(directory, \"file2.csv\")\n", + "\n", + "def generate_dummy_data(num_samples):\n", + " data = [\n", + " np.random.normal(50, 10, num_samples),\n", + " np.random.randint(1, 100, num_samples),\n", + " np.random.choice(['A', 'B', 'C', 'D'], num_samples),\n", + " np.random.uniform(0.0, 1.0, num_samples)\n", + " ]\n", + " return zip(*data)\n", + "\n", + "def save_to_csv(file, data):\n", + " with open(file, 'w', newline='') as f:\n", + " writer = csv.writer(f)\n", + " writer.writerow(['feature1', 'feature2', 'feature3', 'feature4'])\n", + " writer.writerows(data)\n", + "\n", + "num_samples = 100\n", + "save_to_csv(file1, generate_dummy_data(num_samples))\n", + "save_to_csv(file2, generate_dummy_data(num_samples))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Create An Artifact" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The general workflow for creating an Artifact is:\n", + "\n", + "\n", + "1. Intialize a run.\n", + "2. Create an Artifact.\n", + "3. Add a any files or directories to the new Artifact that you want to track and version.\n", + "4. Log the artifact in the W&B platform.\n", + "\n", + "The most straightforward way of accomplishing this is the second line of code in the example below, which will log, track and version a new dataset (i.e. do points 2, 3, and 4 above in one step)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "run = wandb.init(project=\"artifact-basics\")\n", + "run.log_artifact(artifact_or_path=f\"{directory}/file1.csv\", name=\"my_first_artifact\", type=\"dataset\")\n", + "run.finish()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the above example we first initalize a run using [`wandb.init()`](https://docs.wandb.ai/ref/python/init) the `artifact-basics` project. If this project doesn't exist, it will be created. If it alreadt exists, a new W&B Run will be added to it.\n", + "\n", + "\n", + "In the second line we actually log the Artifact with [`run.log_artifact()`](https://docs.wandb.ai/ref/python/public-api/run#log_artifact). In this example, we use three common arguments to the function.\n", + "1. With `artifact_or_path` we specifiy the path to where the data we want to version exists. Any file or directory can be added here.\n", + "2. with `name` we give the artifact a name within Weights & Biases that we will use to access it.\n", + "3. With `type` we give the artifact a higher level grouping. For example, we may have multiple artifacts of type data, and multiple artifacts of type model.\n", + "\n", + "\n", + "See the [Artifacts Reference](https://docs.wandb.ai/ref/python/artifact) guide for more information and other commonly used arguments, including how to store additional metadata.\n", + "\n", + "Each time the above `log_artifact` is executed, wandb will create a new version of the Artifact within Weights & Biases if the underlying data has changed.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "An alternative approach that offers more control (at the expense of more lines of code) can be seen below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "run = wandb.init(project=\"artifact-basics\")\n", + "\n", + "artifact = wandb.Artifact(\"my_first_artifact\", type=\"dataset\")\n", + "# the below will add two individual files to the artifact.\n", + "artifact.add_file(local_path=f\"{directory}/file1.csv\")\n", + "artifact.add_file(local_path=f\"{directory}/file2.csv\")\n", + "# or the below if you wanted to add the entire directory contents.\n", + "artifact.add_dir(local_path=f\"{directory}\")\n", + "# explictly log the artifact to Weights & Biases.\n", + "run.log_artifact(artifact)\n", + "\n", + "wandb.finish()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the above example, lines 3-5 will create a new Artifact within your Weights & Biases project. With the resulting artifact object, you can call the [`artifact.add_file`](https://docs.wandb.ai/ref/python/artifact#add_file) or [`artifact.add_dir`](https://docs.wandb.ai/ref/python/artifact#add_dir) functions in order to add as many files and directories to the Artifact as you want. Once added, the artifact must then be explictly logged to Weights & Biases." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Use an Artifact" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When you want to use a specific version of an Artifact in a downstream task, you can specify the specific version you would like to use via either `v0`, `v1`, `v2` and so on, or via specific aliases you may have added. The `latest` alias always refers to the most recent version of the Artifact logged.\n", + "\n", + "The proceeding code snippet specifies that the W&B Run will use an artifact called `my_first_artifact` with the alias `latest`:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "run = wandb.init(project=\"artifact-basics\")\n", + "artifact = run.use_artifact(artifact_or_name=\"my_first_artifact:latest\") # this creates a reference within Weights & Biases that this artifact was used by this run.\n", + "path = artifact.download() # this downloads the artifact from Weights & Biases to your local system where the code is executing.\n", + "print(f\"Data directory located at {path}\")\n", + "run.finish()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For more information on ways to customize your Artifact download, including via the command line, see the [Download and Usage guide](https://docs.wandb.ai/guides/artifacts/download-and-use-an-artifact)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Create a new Artifact version" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's say we want to modify our dataset while also tracking and versioning these changes. In the below example we will subsample our dataset and save it as a new file. We will use the [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html) library to read our CSV file.\n", + "\n", + "In the second block of code we will log it to Weights & Biases under the same Artifact name (*my_first_artifact*) so that Weights & Biases knows that this is a new version of an existing artifact." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas\n", + "df = pandas.read_csv(f\"{directory}/file1.csv\")\n", + "# subsample to 50% of the original size\n", + "df_subsampled = df.sample(frac=0.5, random_state=1)\n", + "# save the subsampled dataframe to a new file.\n", + "df_subsampled.to_csv(f\"{directory}/file1.csv\", index=False)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we have a new subsampled version of our dataset locally, we can log the new version to Weights & Biases." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "run = wandb.init(project=\"artifact-basics\")\n", + "run.log_artifact(artifact_or_path=f\"{directory}/file1.csv\", name=\"my_first_artifact\", type=\"dataset\", aliases =[\"subsampled\"])\n", + "run.finish()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now the sampled dataset will be logged to the `my_first_artifact` Artifact as a new version.\n", + "\n", + "The Artifact has also been given a custom `alias`, which is a unique label for this Artifact version. While the `alias` is currently `subsampled`, the default aliases is `vN`, where `N` is the number of versions the Artifact has. This increments automatically. You can always access specific versions of an Artifact by using an alias." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Update Artifact version metadata" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can update the `description`, `metadata`, and `alias` of an artifact on the W&B platform during or outside a W&B Run.\n", + "\n", + "\n", + "This example changes the `description` of the `my_first_artifact` artifact inside a run:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "run = wandb.init(project=\"artifact-basics\")\n", + "artifact = run.use_artifact(artifact_or_name=\"my_first_artifact:subsampled\")\n", + "artifact.description = \"This is an edited description.\"\n", + "artifact.metadata = {\"source\": \"local disk\", \"internal data owner\": \"platform team\"}\n", + "artifact.save() # persists changes to an Artifact's properties\n", + "run.finish()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Use the Artifact within your pipelines\n", + "Once the artifact is tracked and versioned within Weights & Biases it's now easy to integrate it into your ML workflows." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "run = wandb.init(project=\"artifact-basics\")\n", + "artifact = run.use_artifact(artifact_or_name=\"my_first_artifact:latest\")\n", + "# the below is left as an exercise to the reader\n", + "# train model\n", + "# log model as artifact\n", + "run.finish()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Navigate the Artifacts UI" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can also manage your Artifacts via the W&B platform. This can give you insight into your model's performance or dataset versioning. To navigate to the relevant information, click this [link](https://wandb.ai/wandb/artifact-basics/overview), then click on the **Artifacts** tab.\n", + "\n", + "Navigating to the **Lineage** section in the tab will show the dependency graph formed by calling `run.use_artifact()` when an Artifact is an input to a run, and `run.log_artifact()` when an Artifact is output to a run. This helps visualize the relationship between different model versions and other objects like datasets and jobs in your project. Click [this](https://wandb.ai/wandb/artifact-basics/artifacts/dataset/my_first_artifact/v0/lineage) link to navigate to the project's lineage page." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Naturally, as you integrate W&B Artifacts into your workflow, lineage graphs such as [this interactive example](https://wandb.ai/wandb-smle/artifact_workflow/artifacts/model/quant_model/v16/lineage) will be built up over time, giving you reproducibility, governance, and auditability.\n", + "\n", + "![Artifact Lineage Example](https://docs.wandb.ai/assets/images/lineage2a-e3fe54c8916c90499aaf3e1e289062bb.gif)\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Next steps\n", + "1. [Artifacts Python reference documentation](https://docs.wandb.ai/ref/python/artifact): Deep dive into artifact parameters and advanced methods.\n", + "2. [Lineage](https://docs.wandb.ai/guides/artifacts/explore-and-traverse-an-artifact-graph): View lineage graphs, which are automatically built when using W&B artifact system, providing an auditable visual overview of the relationships between specific artifact versions, datasets models and runs.\n", + "3. [Model Registry](https://docs.wandb.ai/guides/model_registry): Learn how to centralize your best artifact versions in a shared registry.\n", + "4. [Artifact Automations](https://docs.wandb.ai/guides/artifacts/project-scoped-automations): Automatically run specific Weights & Biases jobs based on changes to your artifacts, such as automatically training a new model each time a new version of the training data is logged.\n", + "5. [Reference Artifacts](https://docs.wandb.ai/guides/artifacts/track-external-files#download-a-reference-artifact): Track files saved outside the W&B server, like Amazon S3 buckets, GCS buckets, Azure blobs, and more." + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "include_colab_link": true, + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}