Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use HEAD versions in presubmit CI #782

Closed
wants to merge 12 commits into from

Conversation

olupton
Copy link
Collaborator

@olupton olupton commented May 1, 2024

Following #770:

This is in preparation for the next stage of work where we will stop pinning versions and go back to running the presubmit on the head of everything.

Flip the switch so that presubmit CIs run bump.sh. Because now the URLREF_XXX build argument default values encode the upstream repository names + default branches, there is no need to duplicate that information in the manifest.yaml committed to this repository.

Leave the workflow for generating patches for Rosetta builds untouched for now, even though there is still some duplication.

The logic for tracking commits of pip-vcs transitive dependencies moves to pip-finalize.sh, which writes the commits used to a new /opt/manifest.d/pip-finalize.yaml. Commits pinned in the "input" manifest file are still respected.

Note that the Dockerfile changes included in #770 cause a new manifest to be written inside the containers, for example here is the file contained in the upstream-pax container from today's nightly (post-#770, pre-this-PR):

# cat /opt/manifest.d/git-clone.yaml
jax:
  url: https://github.com/google/jax.git
  tracking_ref: cbee909a3249fcf6ef9ff2f6708365ea72de29de
  latest_verified_commit: cbee909a3249fcf6ef9ff2f6708365ea72de29de
  mode: git-clone
xla:
  url: https://github.com/openxla/xla.git
  tracking_ref: cba9952b6c91b71adc18ad368ab801d3021ff49f
  latest_verified_commit: cba9952b6c91b71adc18ad368ab801d3021ff49f
  mode: git-clone
flax:
  url: https://github.com/google/flax.git
  tracking_ref: ae5d66d3ad7b54d7bd9544baade1551d4412044a
  latest_verified_commit: ae5d66d3ad7b54d7bd9544baade1551d4412044a
  mode: git-clone
transformer-engine:
  url: https://github.com/NVIDIA/TransformerEngine.git
  tracking_ref: a81786845c6d76c647d690f1148b31d207bffcc2
  latest_verified_commit: a81786845c6d76c647d690f1148b31d207bffcc2
  mode: git-clone
paxml:
  url: https://github.com/google/paxml.git
  tracking_ref: 9e25f0bd23783def8a665ebdb6805d530125b579
  latest_verified_commit: 9e25f0bd23783def8a665ebdb6805d530125b579
  mode: git-clone
praxis:
  url: https://github.com/google/praxis.git
  tracking_ref: 963189fa26647d5d216bb506317fae3d75429996
  latest_verified_commit: 963189fa26647d5d216bb506317fae3d75429996
  mode: git-clone

which can be fed back into the machinery if an old build needs to be reproduced.

@terrykong terrykong requested a review from ko3n1g May 1, 2024 18:11
Copy link
Contributor

@ko3n1g ko3n1g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall clear to me, was not able to spot issues. Also ran a quick test in a separate branch testing the logic to pin commits for reproducing older builds. LGTM.

I just think - as mentioned below - that some more documentation about manifest.yaml would be helpful. Especially now where it became more sparse it is difficult to learn by examples.

@@ -155,15 +159,15 @@ jobs:
# converts manifest yaml to a json object of {SOFTWARE_NAME: URL#REF, ...}
urlrefs=$(
cat .github/container/manifest.yaml |\
yq -o=json 'to_entries | .[] | select(.value.mode == "git-clone") | {( .key | upcase | sub("-", "_") ): .value.url + "#" + .value.latest_verified_commit}' |\
yq -o=json 'to_entries | .[] | select(.value.mode == "git-clone") | {( .key | upcase | sub("-", "_") ): .value.url + "#" + .value.commit}' |\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a bit suprising to me, that I had to add mode: git-clone in order to pull a non-HEAD commit until I found this line. Maybe some documentation about manifest.yaml would be helpful at some point but no high prio as most devs here are familiar with its mechanics I suppose

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the mode is still needed at all. But needing better documentation is an evergreen comment...

echo "${line}@${ref}"
# Find the VCS installs, which are of the form
# PACKAGE @ git+GIT_REPO_URL
for line in $(sed -n -e 's/^\([^#].*\) @ git+\(.*\)$/\1=\2/p' requirements.pre); do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did the regex expression change? I do understand the logic afterwards (19:27) but just not how records in requirements.pre differ such that the regrex matcher needs to be changed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before: it was just extracting PACKAGE. Now: it's extracting the GIT_REPO_URL as well.
This is because in the case that the package is not listed in the manifest file, we need the URL to query the current HEAD.

olupton added a commit that referenced this pull request May 2, 2024
It is still possible to manually launch the pipeline without the bump.
This is following on from #770. Further simplification will be possible.

#782 has become too large, this is the minimal change to make the
presubmit CI behave like the nightlies.
Copy link
Contributor

@ko3n1g ko3n1g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late notice, but there is a small issue around feeding URLREFs into the build-args. See down below for an example/explanation.

Comment on lines 165 to +170
# SOURCE_OVERRIDES is a comma-separated list of package=urlref pairs
IFS=, read -ra overrides <<< "${{ inputs.SOURCE_OVERRIDES }}"
for override in "${overrides[@]}"; do
PACKAGE=$(cut -d= -f 1 <<< "${override}" | tr '[:lower:]' '[:upper:]' | tr '-' '_')
URLREF=$(cut -d= -f 2- <<< "${override}")
urlrefs=$(echo "$urlrefs" | jq -c ". + {\"$PACKAGE\": \"$URLREF\"}")
urlrefs=$(echo "$urlrefs" | jq -c ". + {\"$PACKAGE\": \"URLREF_${PACKAGE}=$URLREF\"}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @olupton , can you walk me through the idea of how manifest.yaml and SOURCE_OVERRIDES are meant to be used together?

If I want to install FLAX on an older commit, I have two options: I can change the manifest.yaml, or use SOURCE_OVERRIDES. When is it recommended to use what?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea here is that there are two modes.

First, when you're using this machinery to test some feature or fix, you don't particularly want to commit/push to JAX-Toolbox, so you can inject your feature/fix branch using SOURCE_OVERRIDES, URLREF_XXX, etc.

Second, when you are preparing a release, you gradually want to converge on a fixed and reproducible set of versions. In that case, you would take some seed manifest.yaml output (perhaps from a nightly) and commit it to a release branch of JAX-Toolbox, and then iterate from there.

Ultimately I don't see why you would want to mix the two modes, and the second one ought to be pedantic. In mode one, you default to taking the HEAD and recording what that was. In mode two, forgetting to specify a version of something should maybe be an error.

Comment on lines 172 to 173
echo "SOURCE_URLREFS=${urlrefs}" >> $GITHUB_OUTPUT

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

urlrefs is currently just a mapping of $PACKAGE: $URL which we feed into the build-args. However, when we feed them into the build-args, the actual build-arg-key is missing. The mapping should be $PACKAGE: URLREF_$PACKAGE=$PACKAGE

@DwarKapex DwarKapex self-requested a review May 16, 2024 15:44
@@ -1,32 +1,14 @@
jax:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but you suggest to use default values for jax ref#commit from Dockerfile rather than from manifest file?
What is the scenario for NGC release? Update Dockerfiles accordingly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely yes for the normal CI: by default we take the Dockerfile defaults, which is basically "HEAD".

When preparing releases, one option (noted here: #782 (comment)) is to commit a manifest file to the release branch and use that. Your suggestion of changing the default values in the Docker files in the release branches would also work, I think, and would require less machinery (to translate manifest --> URLREF build arg) but might be a little more tedious to seed at the beginning of the release process.

cc: @terrykong @ko3n1g

@DwarKapex
Copy link
Contributor

LGTM

@nouiz
Copy link
Collaborator

nouiz commented May 17, 2024

This is in preparation for the next stage of work where we will stop pinning versions and go back to running the presubmit on the head of everything.

Why this change? It was a request to have PR be tested with fixed dependency. Why change this now?

@olupton
Copy link
Collaborator Author

olupton commented May 17, 2024

This is in preparation for the next stage of work where we will stop pinning versions and go back to running the presubmit on the head of everything.

Why this change? It was a request to have PR be tested with fixed dependency. Why change this now?

Because it didn't work. The previous architecture relied on the same commit of JAX-Toolbox being able to build both the nightlies and the "last known good" state of pinned versions. In reality, those two states of the world did not track closely enough to one another for that requirement to be viable.

It was annoying in other ways too, but I think the above is the fundamental point.

@olupton olupton closed this Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants