Skip to content

Commit

Permalink
release v1.3.6
Browse files Browse the repository at this point in the history
- Bump cntoolkit to 2.7.0
- Support MLU370 devices

Signed-off-by: renxiang <[email protected]>
  • Loading branch information
renxiang committed Oct 19, 2022
1 parent 9051eea commit c3af27a
Show file tree
Hide file tree
Showing 47 changed files with 4,325 additions and 1,705 deletions.
6 changes: 3 additions & 3 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@
# See the License for the specific language governing permissions and
# limitations under the License.

image: 192.168.100.44:5001/cambricon/buildpack:20200915
image: 10.3.68.2:5001/cambricon/buildpack:20210507
variables:
GOPROXY: http://192.168.100.44:8080
GOPROXY: http://10.3.68.2:8080

.only-mr-refs: &only-mr-refs
refs:
Expand All @@ -31,7 +31,7 @@ include:

run-shellcheck-lint:
stage: lint
image: 192.168.100.44:5001/cambricon/shellcheck-alpine:v0.7.0
image: 10.3.68.2:5001/cambricon/shellcheck-alpine:v0.7.0
script:
- find . -name '*.sh' -exec shellcheck {} +
only:
Expand Down
2 changes: 2 additions & 0 deletions device-plugin/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,5 @@ k8s-device-plugin
mock_test
image
*.tar*
cntopo
!cntopo/
2 changes: 1 addition & 1 deletion device-plugin/.gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ run-device-plugin-test:
run-device-plugin-integration:
extends: .run-device-plugin-mr
variables:
APT_PROXY: http://192.168.100.44:3142
APT_PROXY: http://10.3.68.2:3142
stage: test
tags:
- shell47
Expand Down
23 changes: 22 additions & 1 deletion device-plugin/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,29 @@
# Changelog

## v1.3.6

- Bump cntoolkit to 2.7.0

## v1.3.5

- Use cntopo to implement topology-aware mode
- Turn device to healthy when it recovers from unhealthy
- Refactor cndev dl implementation from c to go
- Bump cntoolkit to 2.6.0

## v1.3.4

- Add mlu-share mode
- Refactor code to mlu pkg

## v1.3.3

- Fix uuid \x00 suffix
- Support cncodec device dynamic mount

## v1.3.2

- Support MLU365-D2 devices
- Support new devices

## v1.3.1

Expand Down
1 change: 1 addition & 0 deletions device-plugin/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,5 @@ FROM $BASE_IMAGE
ARG TARGETPLATFORM
COPY --from=build /work/k8s-device-plugin /usr/bin/
COPY libs/$TARGETPLATFORM/libcndev.so /usr/lib
COPY libs/$TARGETPLATFORM/cntopo /usr/bin
CMD ["/usr/bin/k8s-device-plugin"]
5 changes: 4 additions & 1 deletion device-plugin/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,14 @@ ifeq ($(GOARCH), arm64)
export CC=aarch64-linux-gnu-gcc
endif

generate:
mockgen -package mock -destination pkg/cntopo/mock/cntopo.go -mock_names=Cntopo=Cntopo github.com/Cambricon/cambricon-k8s-device-plugin/device-plugin/pkg/cntopo Cntopo

lint:
golangci-lint run -v

build:
go build -trimpath -ldflags="-s -w" -o k8s-device-plugin .
go build -trimpath -ldflags="-s -w" -o k8s-device-plugin cmd/main.go

test: go-test mock-test

Expand Down
35 changes: 22 additions & 13 deletions device-plugin/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,14 @@ This repository contains Cambricon's official implementation of the Kubernetes d

The prerequisites for running the Cambricon Device Plugin:

- MLU270, MLU270-X5K, MLU220, MLU290, MLU370, MLU365-D2 devices
- MLU driver >= 4.15.2
- cntoolkit on your building machine >= 2.4.0
- MLU270, MLU270-X5K, MLU220, MLU290, MLU370 devices
- MLU driver >= 4.15.11
- cntoolkit on your building machine >= 2.7.0
- cncl on your building machine >= 1.0.1

For MLU driver version < 4.15.2, please use [release v1.1.3].
For MLU driver version 4.9.x, please use [release v1.1.3].

For Kubernetes version < 1.19.0, MLU290 mlulink topology awareness can not be used. If you want to use this feature, make sure your Kubernetes version >= 1.19.0.
For Kubernetes version < 1.19.0, mlulink topology-aware mode can not be used. If you want to use this feature, make sure your Kubernetes version >= 1.19.0.

## Quick Start

Expand All @@ -44,6 +45,7 @@ Set the following environment variables if you need.
| ARCH | target platform architecture, amd64 or arm64, amd64 by default |
| LIBCNDEV | absolute path of the libcndev.so binary, neuware installation path by default |
| BASE_IMAGE | device plugin base image |
| CNTOPO | absolute path of the cntopo binary, neuware installation path by default |

Docker should be >= 17.05.0 on your building machine. If you want to cross build, make sure docker version >= 19.03.

Expand All @@ -59,8 +61,8 @@ For arm64:
ARCH=arm64 GOPROXY=https://goproxy.cn ./build_image.sh
```

Please make sure Cambricon neuware is installed in your compiling environment.
It uses **libcndev.so** binary on your compiling machine and generates docker image in folder `./image`.
Please make sure Cambricon neuware and cncl is installed in your compiling environment.
It uses **libcndev.so** and **cntopo** binary on your compiling machine and generates docker image in folder `./image`.

### Enabling MLU Support in Kubernetes

Expand All @@ -76,7 +78,7 @@ It uses **libcndev.so** binary on your compiling machine and generates docker im

```yaml
args:
- --mode=default #device plugin mode: default, sriov, env-share or topology-aware
- --mode=default #device plugin mode: default, sriov, env-share, mlu-share or topology-aware
- --virtualization-num=1 # virtualization number for each MLU, used only in sriov mode or env-share mode
- --mlulink-policy=best-effort # MLULink topology policy: best-effort, guaranteed or restricted, used only in topology-aware mode
- --cnmon-path=/usr/bin/cnmon # host machine cnmon path, must be absolute path. comment out this line to avoid mounting cnmon.
Expand All @@ -91,13 +93,14 @@ It uses **libcndev.so** binary on your compiling machine and generates docker im
- sriov: supports SR-IOV. Set `virtualization-num` as number of VFs on host.
- env-share: a whole card can be allocated into multiple containers. A container should use only one card in this mode.
Set `virtualization-num` as maximum number of containers one MLU can be allocated into.
- topology-aware: device plugin is aware of MLULink topology and tries to allocate MLUs forming a cycle. Set `mlulink-policy` as described below. **Only supports MLU290 for now.**
- mlu-share: mlu resources are allocated by memory. **Only works when deployed along with Cambricon MLU Scheduler Extender.**
- topology-aware: device plugin is aware of MLULink topology and tries to allocate MLUs forming a ring. Set `mlulink-policy` as described below.

MLULink topology policies, guaranteed and restricted only works for 1,2,4,8 requested MLUs:

- best-effort: allocate devices forming maximum number of cycles whenever possible
- guaranteed: allocated devices must form at least one cycle, otherwise returns an error
- restricted: for 2 MLUs, allocated devices must have 2 mlulinks, and for 4 MLUs, allocated devices must on one mother board, otherwise returns an error
- best-effort: allocate devices forming maximum number of rings whenever possible
- guaranteed: allocated devices must form at least one ring, otherwise return error
- restricted: for 2 MLUs and MLU290/MLU370-M8 4 MLUs, allocated devices must have 2 mlulink rings, otherwise return error

```shell
kubectl create -f examples/cambricon-device-plugin-daemonset.yaml
Expand All @@ -106,12 +109,18 @@ It uses **libcndev.so** binary on your compiling machine and generates docker im
(Optional) If you do not want the daemonset way of deployment, edit the static pod template in examples folder and
put the file into your configured static pod folder (`/etc/kubernetes/manifests` by default).

3. If you want to use MLU290 topology-aware mode under guaranteed or restricted policy, enable device plugin to get and update nodes.
3. If you want to use topology-aware mode or mlu-share mode, enable device plugin to get and update nodes.

```shell
kubectl create -f examples/cambricon-device-plugin-rbac.yaml
```

And add service account for device plugin as the example

```yaml
serviceAccount: cambricon-device-plugin
```
### Running MLU Jobs
Cambricon MLUs can now be consumed via container level resource requirements using the resource name `cambricon.com/mlu`:
Expand Down
16 changes: 15 additions & 1 deletion device-plugin/build_image.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,10 @@
curpath=$(dirname "$0")
cd "$curpath" || exit 1

: "${TAG:=v1.3.2}"
: "${TAG:=v1.3.6}"
: "${ARCH:=amd64}"
: "${LIBCNDEV:=/usr/local/neuware/lib64/libcndev.so}"
: "${CNTOPO:=/usr/local/neuware/bin/cntopo}"

case $(awk -F= '/^NAME/{print $2}' /etc/os-release) in
"CentOS Linux")
Expand All @@ -35,6 +36,7 @@ echo "LIBCNDEV = $LIBCNDEV"
echo "APT_PROXY = $APT_PROXY"
echo "GOPROXY = $GOPROXY"
echo "BASE_IMAGE = $BASE_IMAGE"
echo "CNTOPO = $CNTOPO"

case $(uname -m) in
x86_64)
Expand All @@ -58,6 +60,12 @@ if [[ ! -f "$LIBCNDEV" ]]; then
exit 1
fi

if [[ ! -f "$CNTOPO" ]]; then
echo "Can't find cntopo at $CNTOPO."
echo "Please install Cambricon cncl, or set CNTOPO environ to path of cntopo"
exit 1
fi

case $ARCH in
amd64)
file_arch=x86-64
Expand All @@ -75,8 +83,13 @@ if ! file "$LIBCNDEV" --dereference | grep -q "$file_arch"; then
echo "$LIBCNDEV is not for $ARCH"
exit 1
fi
if ! file "$CNTOPO" --dereference | grep -q "$file_arch"; then
echo "$CNTOPO is not for $ARCH"
exit 1
fi

cp "$LIBCNDEV" "$curpath/libs/linux/$ARCH/libcndev.so"
cp "$CNTOPO" "$curpath/libs/linux/$ARCH/cntopo"

echo "Building Cambricon device plugin docker image."

Expand Down Expand Up @@ -104,3 +117,4 @@ fi

echo "Image is saved at ./image/cambricon-k8s-device-plugin-$ARCH.tar"
rm -f "$curpath/libs/linux/$ARCH/libcndev.so"
rm -f "$curpath/libs/linux/$ARCH/cntopo"
11 changes: 7 additions & 4 deletions device-plugin/main.go → device-plugin/cmd/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,14 @@ import (
"syscall"

"github.com/Cambricon/cambricon-k8s-device-plugin/device-plugin/pkg/cndev"
"github.com/Cambricon/cambricon-k8s-device-plugin/device-plugin/pkg/mlu"
"github.com/fsnotify/fsnotify"
pluginapi "k8s.io/kubelet/pkg/apis/deviceplugin/v1beta1"
)

func main() {

options := ParseFlags()
options := mlu.ParseFlags()

log.Println("Loading CNDEV")
if err := cndev.Init(); err != nil {
Expand All @@ -39,7 +40,9 @@ func main() {

log.Println("Fetching devices.")
n, err := cndev.GetDeviceCount()
check(err)
if err != nil {
log.Panicf("Failed to get device count. err: %v", err)
}
if n == 0 {
log.Println("No devices found. Waiting indefinitely.")
select {}
Expand All @@ -56,14 +59,14 @@ func main() {
log.Println("Starting OS watcher.")
sigs := startOSWatcher(syscall.SIGHUP, syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT)

var devicePlugin *CambriconDevicePlugin
var devicePlugin *mlu.CambriconDevicePlugin

restart:
if devicePlugin != nil {
devicePlugin.Stop()
}
startErr := make(chan struct{})
devicePlugin = NewCambriconDevicePlugin(options)
devicePlugin = mlu.NewCambriconDevicePlugin(options)
if err := devicePlugin.Serve(); err != nil {
log.Printf("serve device plugin err: %v, restarting.", err)
close(startErr)
Expand Down
5 changes: 3 additions & 2 deletions device-plugin/examples/cambricon-device-plugin-daemonset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -43,13 +43,14 @@ spec:
- key: cambricon.com/mlu
operator: Exists
effect: NoSchedule
#serviceAccount: cambricon-device-plugin # uncomment to add rbac
containers:
- image: cambricon-k8s-device-plugin:v1.3.2
- image: cambricon-k8s-device-plugin:v1.3.6
name: cambricon-device-plugin-ctr
command:
- /usr/bin/k8s-device-plugin
args:
- --mode=default #device plugin mode: default, sriov, env-share or topology-aware
- --mode=default #device plugin mode: default, sriov, env-share, mlu-share or topology-aware
- --virtualization-num=1 # virtualization number for each MLU, used only in sriov mode or env-share mode
- --mlulink-policy=best-effort # MLULink topology policy: best-effort, guaranteed or restricted, used only in topology-aware mode
- --cnmon-path=/usr/bin/cnmon # host machine cnmon path, must be absolute path. comment out this line to avoid mounting cnmon.
Expand Down
19 changes: 18 additions & 1 deletion device-plugin/examples/cambricon-device-plugin-rbac.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,17 @@ rules:
verbs:
- get
- update
- list
- patch
- apiGroups:
- ""
resources:
- pods
verbs:
- update
- patch
- get
- list
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
Expand All @@ -35,5 +46,11 @@ roleRef:
name: cambricon-device-plugin
subjects:
- kind: ServiceAccount
name: default
name: cambricon-device-plugin
namespace: kube-system
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: cambricon-device-plugin
namespace: kube-system
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,14 @@ metadata:
name: cambricon-device-plugin-static-pod
namespace: kube-system
spec:
#serviceAccount: cambricon-device-plugin # uncomment to add rbac
containers:
- image: cambricon-k8s-device-plugin:v1.3.2
- image: cambricon-k8s-device-plugin:v1.3.6
name: cambricon-device-plugin-ctr
command:
- /usr/bin/k8s-device-plugin
args:
- --mode=default #device plugin mode: default, sriov, env-share or topology-aware
- --mode=default #device plugin mode: default, sriov, env-share, mlu-share or topology-aware
- --virtualization-num=1 # virtualization number for each MLU, used only in sriov mode or env-share mode
- --mlulink-policy=best-effort # MLULink topology policy: best-effort, guaranteed or restricted, used only in topology-aware mode
- --cnmon-path=/usr/bin/cnmon # host machine cnmon path, must be absolute path. comment out this line to avoid mounting cnmon.
Expand Down
3 changes: 3 additions & 0 deletions device-plugin/go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,10 @@ go 1.13

require (
github.com/fsnotify/fsnotify v1.4.9
github.com/golang/mock v1.3.1
github.com/jessevdk/go-flags v1.4.0
github.com/onsi/ginkgo v1.11.0
github.com/onsi/gomega v1.7.0
github.com/stretchr/testify v1.4.0
golang.org/x/net v0.0.0-20200822124328-c89045814202 // indirect
google.golang.org/grpc v1.31.1
Expand Down
1 change: 1 addition & 0 deletions device-plugin/go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@ github.com/golang/groupcache v0.0.0-20190702054246-869f871628b6/go.mod h1:cIg4er
github.com/golang/groupcache v0.0.0-20191227052852-215e87163ea7/go.mod h1:cIg4eruTrX1D+g88fzRXU5OdNfaM+9IcxsU14FzY7Hc=
github.com/golang/mock v1.1.1/go.mod h1:oTYuIxOrZwtPieC+H1uAHpcLFnEyAGVDL/k47Jfbm0A=
github.com/golang/mock v1.2.0/go.mod h1:oTYuIxOrZwtPieC+H1uAHpcLFnEyAGVDL/k47Jfbm0A=
github.com/golang/mock v1.3.1 h1:qGJ6qTW+x6xX/my+8YUVl4WNpX9B7+/l2tRsHGZ7f2s=
github.com/golang/mock v1.3.1/go.mod h1:sBzyDLLjw3U8JLTeZvSv8jJB+tU5PVekmnlKIyFUx0Y=
github.com/golang/protobuf v1.2.0/go.mod h1:6lQm79b+lXiMfvg/cZm0SGofjICqVBUtrP5yJMmIC1U=
github.com/golang/protobuf v1.3.1/go.mod h1:6lQm79b+lXiMfvg/cZm0SGofjICqVBUtrP5yJMmIC1U=
Expand Down
Loading

0 comments on commit c3af27a

Please sign in to comment.