Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds new artifacts colab. #526

Merged
merged 27 commits into from
Jul 25, 2024
Merged

Adds new artifacts colab. #526

merged 27 commits into from
Jul 25, 2024

Conversation

katjacksonWB
Copy link
Contributor

Adds a new Artifacts colab to replace the old, outdated one linked on the Artifacts landing page.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link

github-actions bot commented May 15, 2024

Thanks for contributing to wandb/examples!
We appreciate your efforts in opening a PR for the examples repository. Our goal is to ensure a smooth and enjoyable experience for you 😎.

Guidelines

The examples repo is regularly tested against the ever-evolving ML stack. To facilitate our work, please adhere to the following guidelines:

  • Notebook naming: You can use a combination of snake_case and CamelCase for your notebook name. Avoid using spaces (replace them with _) and special characters (&%$?). For example:
Cool_Keras_integration_example_with_weights_and_biases.ipynb 

is acceptable, but

Cool Keras Example with W&B.ipynb

is not. Avoid spaces and the & character. To refer to W&B, you can use: weights_and_biases or just wandb (it's our library, after all!)

  • Managing dependencies within the notebook: You may need to set up dependencies to ensure that your code works. Please avoid the following practices:

    • Docker-related activities. If Docker installation is required, consider adding a full example with the corresponding Dockerfile to the wandb/examples/examples folder (where non-Colab examples reside).
    • Using pip install as the primary method to install packages. When calling pip in a cell, avoid performing other tasks. We automatically filter these types of cells, and executing other actions might break the automatic testing of the notebooks. For example,
    pip install -qU wandb transformers gpt4
    

    is acceptable, but

    pip install -qU wandb
    import wandb

    is not.

    • Installing packages from a GitHub branch. Although it's acceptable 😎 to directly obtain the latest bleeding-edge libraries from GitHub, did you know that you can install them like this:
    !pip install -q git+https://github.com/huggingface/transformers

    You don't need to clone, then cd into the repo and install it in editable mode.

    • Avoid referencing specific Colab directories. Google Colab has a /content directory where everything resides. Avoid explicitly referencing this directory because we test our notebooks with pure Jupyter (without Colab). Instead, use relative paths to make the notebook reproducible.
  • The Jupyter notebook file .ipynb is nothing more than a JSON file with primarily two types of cells: markdown and code. There is also a bunch of other metadata specific to Google Colab. We have a set of tools to ensure proper notebook formatting. These tools can be found at wandb/nb_helpers.

Before merging, wait for a maintainer to clean and format the notebooks you're adding. You can tag @tcapelle.

Before marking the PR as ready for review, please run your notebook one more time. Restart the Colab and run all. We will provide you with links to open the Colabs below

The following colabs were changed
-colabs/wandb-artifacts/Artifact_fundamentals.ipynb

@noaleetz
Copy link
Contributor

Hey @katjacksonWB - some feedback from reviewing the whole thing:

  • I think it's important to cover how to version an artifact because if not the colab doesn't really show the utility of logging your stuff to an artifact. It can be a minimal example, like adding a few new images and showing that a new version is created
  • the colab should also showcase how someone can navigate to the UI for the artifact logged and link to a public project that the user can look at the understand how the actions done in colab reflect in UI (the SDK prints out a URL so we should show user how to find this to get to the UI from their log_artifact command)

Copy link
Contributor

@noaleetz noaleetz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a comment with some requested changes!

Copy link
Contributor

@noaleetz noaleetz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left another round of comments!

@katjacksonWB katjacksonWB requested a review from noaleetz May 23, 2024 19:59
@tcapelle
Copy link
Collaborator

Ping me when ready for final review/merge

@noaleetz noaleetz requested review from rymc and removed request for moredatarequired May 28, 2024 19:10
Copy link
Collaborator

@tcapelle tcapelle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we don't show the classic:

# you can log using the one-liner:
wandb.log_artifact("file.csv", name="my_artifact", type="data")

# or
at = Artifact(name="my_artifact",  type="data")
at.add_file("file.csv")
# add_dir(...)
wandb.log_artifact(at)
  • Shouldn't this live in wandb-artifacts?
  • Please also replace the "old outdated one"

@noaleetz
Copy link
Contributor

Why we don't show the classic:

# you can log using the one-liner:
wandb.log_artifact("file.csv", name="my_artifact", type="data")

# or
at = Artifact(name="my_artifact",  type="data")
at.add_file("file.csv")
# add_dir(...)
wandb.log_artifact(at)
  • Shouldn't this live in wandb-artifacts?
  • Please also replace the "old outdated one"

is referring to a specific line?

Copy link
Contributor

@noaleetz noaleetz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last request on lineage

"source": [
"You can also manage your Artifacts via the W&B platform. This can give you insight into your model's performance or dataset versioning. To navigate to the relevant information, click this [link](https://wandb.ai/wandb/artifact-basics/overview), then click on the **Artifacts** tab.\n",
"\n",
"Navigating to the **Lineage** section in the tab will show the dependency graph formed by calling `run.use_artifact()` when an Artifact is an input to a run, and `run.log_artifact()` when an Artifact is output to a run. This helps visualize the relationship between different model versions and other objects like datasets and jobs in your project. Click [this](https://wandb.ai/wandb/artifact-basics/artifacts/dataset/my_first_artifact/v0/lineage) link to navigate to the project's lineage page."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make sure we include a screenshot of a more complex lineage for a user to explore, and also link the relevant project (probably the artifact workflow project)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeap, I am missing some screenshots. Add those or from the docs or upload them as files in the same folder.

" inplace=True)\n",
"csvData.to_csv(\"/content/sample_data/california_housing_test.csv\") # overwrites file with the sorted data\n",
"# adds the new file to the artifact\n",
"run = wandb.init(project=\"artifact-basics\")\n",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think would be better here to init the run at the start of the code block.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • maybe split in 2 cells.
  • csv_data instead of csvData.

"cell_type": "markdown",
"metadata": {},
"source": [
"Now the sorted file will be logged in `my_first_artifact`. Any changes you log to an artifact will overwrite any older version. \n",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm the wording of overwriting old versions may be confusing to users (line 228 and 236) as we don't really overwrite the old version, we create a new version instead. Overwriting to me implies it replaces the previous.

"artifact = run.use_artifact(artifact_or_name=\"my_first_artifact:latest\")\n",
"# This will download the specified artifact to where your code is running\n",
"datadir = artifact.download()\n",
"run.finish()\n",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest moving the run.finish() to the end of the block as it is usually better practice (e.g., in this case, it would capture what is printed out)

"source": [
"You can also manage your Artifacts via the W&B platform. This can give you insight into your model's performance or dataset versioning. To navigate to the relevant information, click this [link](https://wandb.ai/wandb/artifact-basics/overview), then click on the **Artifacts** tab.\n",
"\n",
"Navigating to the **Lineage** section in the tab will show the dependency graph formed by calling `run.use_artifact()` when an Artifact is an input to a run, and `run.log_artifact()` when an Artifact is output to a run. This helps visualize the relationship between different model versions and other objects like datasets and jobs in your project. Click [this](https://wandb.ai/wandb/artifact-basics/artifacts/dataset/my_first_artifact/v0/lineage) link to navigate to the project's lineage page."
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the following wording here may be a little confusing to some users:

"run.log_artifact() when an Artifact is output to a run"

I think "to a run" should be "of a run"?

@tcapelle
Copy link
Collaborator

Why we don't show the classic:

# you can log using the one-liner:
wandb.log_artifact("file.csv", name="my_artifact", type="data")

# or
at = Artifact(name="my_artifact",  type="data")
at.add_file("file.csv")
# add_dir(...)
wandb.log_artifact(at)
  • Shouldn't this live in wandb-artifacts?
  • Please also replace the "old outdated one"

is referring to a specific line?

This plan and the code don't match:
image

You are doing 2+3+4 in one line.

@tcapelle
Copy link
Collaborator

I would also use this PR to remove/replace old stuff in wandb-artifacts (and put this file in there as a getting started)

"outputs": [],
"source": [
"!pip install wandb\n",
"import wandb\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split pip install from import wandb please

"The general workflow for creating an Artifact is:\n",
"\n",
"\n",
"1. Intialize a run.\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just feel like we are not following with code this plan.

" inplace=True)\n",
"csvData.to_csv(\"/content/sample_data/california_housing_test.csv\") # overwrites file with the sorted data\n",
"# adds the new file to the artifact\n",
"run = wandb.init(project=\"artifact-basics\")\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • maybe split in 2 cells.
  • csv_data instead of csvData.

"source": [
"run = wandb.init(project=\"artifact-basics\")\n",
"artifact = run.use_artifact(artifact_or_name=\"my_first_artifact:latest\")\n",
"# This will download the specified artifact to where your code is running\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add blanck line before the comment for readability

"source": [
"You can also manage your Artifacts via the W&B platform. This can give you insight into your model's performance or dataset versioning. To navigate to the relevant information, click this [link](https://wandb.ai/wandb/artifact-basics/overview), then click on the **Artifacts** tab.\n",
"\n",
"Navigating to the **Lineage** section in the tab will show the dependency graph formed by calling `run.use_artifact()` when an Artifact is an input to a run, and `run.log_artifact()` when an Artifact is output to a run. This helps visualize the relationship between different model versions and other objects like datasets and jobs in your project. Click [this](https://wandb.ai/wandb/artifact-basics/artifacts/dataset/my_first_artifact/v0/lineage) link to navigate to the project's lineage page."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeap, I am missing some screenshots. Add those or from the docs or upload them as files in the same folder.

@noaleetz
Copy link
Contributor

noaleetz commented Jul 7, 2024

Hey @rymc - would you be up to revising the changes you proposed directly? Katherine is out on medical leave so I am trying to get some support with wrapping up her in-flight docs PR so we can get the new and improved Artifacts colab out. If it is easy enough to make those fixes directly that would be a huge help so I can focus on some of the other docs work.

@rymc
Copy link

rymc commented Jul 8, 2024

Hey @noaleetz done. Addressed comments and confirmed working on Colab.

@noaleetz
Copy link
Contributor

noaleetz commented Jul 8, 2024

Hey @noaleetz done. Addressed comments and confirmed working on Colab.

Ryan you are awesome, thank you so so much. I will give the colab a final run myself, but we should be good to merge. @ngrayluna can I ask you to give your review and stamp as well?

Copy link
Contributor

@ngrayluna ngrayluna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker: Notebook needs to be executable

"outputs": [],
"source": [
"run = wandb.init(project=\"artifact-basics\")\n",
"run.log_artifact(artifact_or_path=\"/content/sample_data/mnist_test.csv\", name=\"my_first_artifact\", type=\"dataset\")\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notebooks need to be executable...we'll want to use a real dataset before merging this in.

@rymc
Copy link

rymc commented Jul 10, 2024

Ah, good point @ngrayluna. I've pushed a new version that makes the Colab notebook executable regardless of where it runs.

@ngrayluna
Copy link
Contributor

PR for small nits: #548

@tcapelle
Copy link
Collaborator

can you make both wandbcode consistent?

@ngrayluna
Copy link
Contributor

can you make both wandbcode consistent?

Not sure I follow?

@ngrayluna ngrayluna merged commit f46ecee into master Jul 25, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants