-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add basic example of NIM with Run.ai inference #81
Open
mlsorensen
wants to merge
7
commits into
NVIDIA:main
Choose a base branch
from
mlsorensen:runai-example
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
67e1d3c
Add basic example of NIM with Run.ai inference
85c3a97
Update READMEs
b219df9
Minor doc fixes and optimizations
b2bb104
Add PVC based example to Run.ai
6e8c6f1
Fixes
c996ee8
Move run.ai to docs, address comments
48644bb
Add air-gapped text
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
# NIMs on Run.ai | ||
|
||
[Run.ai](https://www.run.ai/) provides a platform for accelerating AI development delivering life cycle support spanning from concept to deployment of AI workloads. It layers on top of Kubernetes starting with a single cluster but extending to centralized multi-cluster management. It provides UI, GPU-aware scheduling, container orchestration, node pooling, organizational resource quota management, and more. And it offers administrators, researchers, and developers tools to manage resources across multiple Kubernetes clusters and subdivide them across project and departments, and automates Kubernetes primitives with its own AI optimized resources. | ||
|
||
## Run.ai Deployment Options | ||
|
||
The Run:ai Control Plane is available as a [hosted service](https://docs.run.ai/latest/home/components/#runai-control-plane-on-the-cloud) or alternatively as a [self-hosted](https://docs.run.ai/latest/home/components/#self-hosted-control-plane) option (including in disconnected "air-gapped" environments). In either case, the control plane can manage Run:ai "cluster engine" equipped clusters whether local or remotely cloud hosted. | ||
|
||
## Prerequisites | ||
|
||
1. A conformant Kubernetes cluster ([RunAI K8s version requirements](https://docs.run.ai/latest/admin/overview-administrator/)) | ||
2. RunAI Control Plane and cluster(s) [installed](https://docs.run.ai/latest/admin/runai-setup/cluster-setup/cluster-install/) and operational | ||
3. [NVIDIA GPU Operator](https://github.com/NVIDIA/gpu-operator) installed | ||
4. General NIM requirements: [NIM Prerequisites](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html#prerequisites) | ||
5. An NVIDIA AI Enterprise (NVAIE) License: [Sign up for NVAIE license](https://build.nvidia.com/meta/llama-3-8b-instruct?snippet_tab=Docker&signin=true&integrate_nim=true&self_hosted_api=true) or [Request a Free 90-Day NVAIE License](https://enterpriseproductregistration.nvidia.com/?LicType=EVAL&ProductFamily=NVAIEnterprise) through the NVIDIA Developer Program. | ||
6. An NVIDIA NGC API Key: please follow the guidance in the [NVIDIA NIM Getting Started](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html#option-2-from-ngc) documentation to generate a properly scoped API key if you haven't already. | ||
|
||
## InferenceWorkload | ||
|
||
Run.ai provides an [InferenceWorkload](https://docs.run.ai/latest/Researcher/workloads/inference-overview/) resource to help automate inference services like NIMs. It leverages [Knative](https://github.com/knative) to automate the underlying service and routing of traffic. YAML examples can be found [here](https://docs.run.ai/latest/developer/cluster-api/submit-yaml/#inference-workload-example). | ||
|
||
It should be noted that InferenceWorkload is an optional add-on for Run.ai. Consult your Run.ai UI portal or cluster administrator to determine which clusters support InferenceWorkload. | ||
|
||
### Basic Example | ||
|
||
At the core, running NIMs with InferenceWorkload is quite simple. However, many customizations are possible, such as adding variables, PVCs to cache models, health checks, and other special configurations that will pass through to the pods backing the services. The `examples` directory can evolve over time with more complex deployment examples. The following example is a bare minimum configuration. | ||
|
||
This example can also be deployed through [UI](https://docs.run.ai/latest/Researcher/workloads/inference-overview/) - including creating the secret and InferenceWorkload. | ||
|
||
**Preparation**: | ||
* A Runai Project (and corresponding Kubernetes namespace, which is the project name prefixed with `runai-`). You should be set up to run "kubectl" commands to the target cluster and namespace. | ||
* An NGC API Key | ||
* `curl` and `jq` for the test script | ||
* A Docker registry secret for `nvcr.io` needs to exist in your Run.ai project. This can only be created through the UI, via "credentials" section. Add a new docker-registry credential, choose the scope to be your project, set username to `$oauthtoken` and password to your NGC API key. Set the registry url to `nvcr.io`. This only has to be done once per scope, and Run.ai will detect and use it when it is needed. | ||
|
||
1. Deploy InferenceWorkload to your current Kubernetes context via Helm, with working directory being the same as this README, setting the necessary environment variables | ||
|
||
``` | ||
% export NAMESPACE=[namespace] | ||
% export NGC_KEY=[ngc key] | ||
% helm install --set namespace=$NAMESPACE --set ngcKey=$NGC_KEY my-llama-1 examples/basic-llama | ||
``` | ||
|
||
Now, wait for the InferenceWorkload's ksvc to become ready. | ||
|
||
``` | ||
% kubectl get ksvc basic-llama -o wide --watch | ||
NAME URL LATESTCREATED LATESTREADY READY REASON | ||
basic-llama http://basic-llama.runai-myproject.inference.12345678.dgxc.ngc.nvidia.com basic-llama-00001 Unknown RevisionMissing | ||
basic-llama http://basic-llama.runai-myproject.inference.12345678.dgxc.ngc.nvidia.com basic-llama-00001 basic-llama-00001 Unknown RevisionMissing | ||
basic-llama http://basic-llama.runai-myproject.inference.12345678.dgxc.ngc.nvidia.com basic-llama-00001 basic-llama-00001 Unknown IngressNotConfigured | ||
basic-llama http://basic-llama.runai-myproject.inference.12345678.dgxc.ngc.nvidia.com basic-llama-00001 basic-llama-00001 Unknown Uninitialized | ||
basic-llama http://basic-llama.runai-myproject.inference.12345678.dgxc.ngc.nvidia.com basic-llama-00001 basic-llama-00001 True | ||
``` | ||
|
||
2. Query your new inference service | ||
|
||
As seen above, you will get a new knative service accessible via hostname-based routing. Use the hostname from this URL to pass to the test script by setting an environment variable `LHOST`. | ||
|
||
``` | ||
% export LHOST="basic-llama.runai-myproject.inference.12345678.dgxc.ngc.nvidia.com" | ||
% ./examples/query-llama.sh | ||
Here's a song about pizza: | ||
|
||
**Verse 1** | ||
I'm walkin' down the street, smellin' something sweet | ||
Followin' the aroma to my favorite treat | ||
A slice of heaven in a box, or so I've been told | ||
Gimme that pizza love, and my heart will be gold | ||
``` | ||
|
||
3. Remove inference service | ||
|
||
``` | ||
% helm uninstall my-llama-1 | ||
release "my-llama-1" uninstalled | ||
``` | ||
### PVC Example | ||
|
||
The PVC example runs in much the same way. It adds a mounted PVC to the example NIM container in a place where it can be used as a cache - `/opt/nim/.cache`, and configured to be retained between helm uninstall and install, so that the model data need only be downloaded on first use. | ||
|
||
``` | ||
% helm install --set namespace=$NAMESPACE --set ngcKey=$NGC_KEY my-llama-pvc examples/basic-llama-pvc | ||
|
||
% kubectl get ksvc basic-llama-pvc --watch | ||
``` | ||
|
||
### Troubleshooting | ||
|
||
Users can troubleshoot workloads by looking at the underlying resources that are created. There should be deployments, pods, ksvcs to describe or view logs from. | ||
|
||
## Air-gapped operations | ||
|
||
For scenarios in which Run:ai clusters are operating in air-gapped (disconnected) environments, please see NVIDIA NIM documentation for [serving models from local assets](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html#serving-models-from-local-assets). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# Patterns to ignore when building packages. | ||
# This supports shell glob matching, relative path matching, and | ||
# negation (prefixed with !). Only one pattern per line. | ||
.DS_Store | ||
# Common VCS dirs | ||
.git/ | ||
.gitignore | ||
.bzr/ | ||
.bzrignore | ||
.hg/ | ||
.hgignore | ||
.svn/ | ||
# Common backup files | ||
*.swp | ||
*.bak | ||
*.tmp | ||
*.orig | ||
*~ | ||
# Various IDEs | ||
.project | ||
.idea/ | ||
*.tmproj | ||
.vscode/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
apiVersion: v2 | ||
name: basic-llama-pvc | ||
description: A Helm chart for Kubernetes | ||
|
||
# A chart can be either an 'application' or a 'library' chart. | ||
# | ||
# Application charts are a collection of templates that can be packaged into versioned archives | ||
# to be deployed. | ||
# | ||
# Library charts provide useful utilities or functions for the chart developer. They're included as | ||
# a dependency of application charts to inject those utilities and functions into the rendering | ||
# pipeline. Library charts do not define any templates and therefore cannot be deployed. | ||
type: application | ||
|
||
# This is the chart version. This version number should be incremented each time you make changes | ||
# to the chart and its templates, including the app version. | ||
# Versions are expected to follow Semantic Versioning (https://semver.org/) | ||
version: 0.1.0 | ||
|
||
# This is the version number of the application being deployed. This version number should be | ||
# incremented each time you make changes to the application. Versions are not expected to | ||
# follow Semantic Versioning. They should reflect the version the application is using. | ||
# It is recommended to use it with quotes. | ||
appVersion: "1.0.0" |
39 changes: 39 additions & 0 deletions
39
docs/run.ai/examples/basic-llama-pvc/templates/inferenceworkload.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
apiVersion: run.ai/v2alpha1 | ||
kind: InferenceWorkload | ||
metadata: | ||
name: basic-llama-pvc | ||
namespace: {{ .Values.namespace }} | ||
spec: | ||
name: | ||
value: basic-llama-pvc | ||
environment: | ||
items: | ||
NGC_API_KEY: | ||
value: SECRET:ngc-secret-pvc,NGC_API_KEY | ||
gpu: | ||
value: "1" | ||
image: | ||
value: "nvcr.io/nim/meta/llama-3.1-8b-instruct" | ||
minScale: | ||
value: 1 | ||
maxScale: | ||
value: 2 | ||
runAsUid: | ||
value: 1000 | ||
runAsGid: | ||
value: 1000 | ||
ports: | ||
items: | ||
serving-port: | ||
value: | ||
container: 8000 | ||
protocol: http | ||
serviceType: ServingPort | ||
pvcs: | ||
items: | ||
pvc: | ||
value: | ||
claimName: nim-cache | ||
existingPvc: true | ||
path: /opt/nim/.cache | ||
readOnly: false |
8 changes: 8 additions & 0 deletions
8
docs/run.ai/examples/basic-llama-pvc/templates/ngc-secret.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
apiVersion: v1 | ||
kind: Secret | ||
type: Opaque | ||
metadata: | ||
name: ngc-secret-pvc | ||
namespace: {{ .Values.namespace}} | ||
data: | ||
NGC_API_KEY: {{ .Values.ngcKey | b64enc }} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
kind: PersistentVolumeClaim | ||
apiVersion: v1 | ||
metadata: | ||
name: nim-cache | ||
namespace: {{ .Values.namespace }} | ||
annotations: | ||
helm.sh/resource-policy: "keep" | ||
spec: | ||
storageClassName: {{ .Values.storageClassName }} | ||
accessModes: | ||
- ReadWriteMany | ||
resources: | ||
requests: | ||
storage: 32Gi |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
|
||
# These can be edited here locally, but should be overridden like so: | ||
# helm install --set namespace=$NAMESPACE --set ngcKey=$NGC_KEY | ||
namespace: override-with-flag | ||
ngcKey: override-with-flag | ||
|
||
## optional to override | ||
storageClassName: standard-rwx |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# Patterns to ignore when building packages. | ||
# This supports shell glob matching, relative path matching, and | ||
# negation (prefixed with !). Only one pattern per line. | ||
.DS_Store | ||
# Common VCS dirs | ||
.git/ | ||
.gitignore | ||
.bzr/ | ||
.bzrignore | ||
.hg/ | ||
.hgignore | ||
.svn/ | ||
# Common backup files | ||
*.swp | ||
*.bak | ||
*.tmp | ||
*.orig | ||
*~ | ||
# Various IDEs | ||
.project | ||
.idea/ | ||
*.tmproj | ||
.vscode/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
apiVersion: v2 | ||
name: basic-llama | ||
description: A Helm chart for Kubernetes | ||
|
||
# A chart can be either an 'application' or a 'library' chart. | ||
# | ||
# Application charts are a collection of templates that can be packaged into versioned archives | ||
# to be deployed. | ||
# | ||
# Library charts provide useful utilities or functions for the chart developer. They're included as | ||
# a dependency of application charts to inject those utilities and functions into the rendering | ||
# pipeline. Library charts do not define any templates and therefore cannot be deployed. | ||
type: application | ||
|
||
# This is the chart version. This version number should be incremented each time you make changes | ||
# to the chart and its templates, including the app version. | ||
# Versions are expected to follow Semantic Versioning (https://semver.org/) | ||
version: 0.1.0 | ||
|
||
# This is the version number of the application being deployed. This version number should be | ||
# incremented each time you make changes to the application. Versions are not expected to | ||
# follow Semantic Versioning. They should reflect the version the application is using. | ||
# It is recommended to use it with quotes. | ||
appVersion: "1.0.0" |
27 changes: 27 additions & 0 deletions
27
docs/run.ai/examples/basic-llama/templates/inferenceworkload.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
apiVersion: run.ai/v2alpha1 | ||
kind: InferenceWorkload | ||
metadata: | ||
name: basic-llama | ||
namespace: {{ .Values.namespace }} | ||
spec: | ||
name: | ||
value: basic-llama | ||
environment: | ||
items: | ||
NGC_API_KEY: | ||
value: SECRET:ngc-secret,NGC_API_KEY | ||
gpu: | ||
value: "1" | ||
image: | ||
value: "nvcr.io/nim/meta/llama-3.1-8b-instruct" | ||
minScale: | ||
value: 1 | ||
maxScale: | ||
value: 2 | ||
ports: | ||
items: | ||
serving-port: | ||
value: | ||
container: 8000 | ||
protocol: http | ||
serviceType: ServingPort |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
apiVersion: v1 | ||
kind: Secret | ||
type: Opaque | ||
metadata: | ||
name: ngc-secret | ||
namespace: {{ .Values.namespace}} | ||
data: | ||
NGC_API_KEY: {{ .Values.ngcKey | b64enc }} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
|
||
# These can be edited here locally, but should be overridden like so: | ||
# helm install --set namespace=$NAMESPACE --set ngcKey=$NGC_KEY | ||
namespace: override-with-flag | ||
ngcKey: override-with-flag |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
#!/bin/bash | ||
|
||
if [[ -z $LHOST ]]; then | ||
echo "please provide an LHOST env var" | ||
exit 1 | ||
fi | ||
|
||
Q="Write a song about pizza" | ||
MODEL=$(curl -s "http://${LHOST}/v1/models" | jq -r '.data[0]|.id') | ||
|
||
curl -s "http://${LHOST}/v1/chat/completions" \ | ||
-H "Accept: application/json" \ | ||
-H "Content-Type: application/json" \ | ||
-d '{ | ||
"messages": [ | ||
{ | ||
"content": "'"${Q}"'", | ||
"role": "user" | ||
} | ||
], | ||
"model": "'"${MODEL}"'", | ||
"max_tokens": 500, | ||
"top_p": 0.8, | ||
"temperature": 0.9, | ||
"seed": '$RANDOM', | ||
"stream": false, | ||
"stop": ["hello\n"], | ||
"frequency_penalty": 1.0 | ||
}' | jq -r '.choices[0]|.message.content' |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prerequisites
Required: ✔️
Provided: ✅
Required: ✔️
Provided: ✅