This document mainly describes how to use components such as Paddle Operator and ElasticServing to complete the Pipeline of the Wide & Deep model, which includes the steps of data preparation, model training and model serving
.
The Wide & Deep model is a recommended framework published by Google in 2016. The model code used in this demo is from PaddleRec Project .
This demo uses the Criteo dataset provided by Display Advertising Challenge,
And We have stored the data in the public bucket of Baidu Object Storage (BOS): bos:// paddleflow-public.hkg.bcebos.com/criteo
.
Use the cache component of Paddle Operator to cache sample data locally can speed up model training jobs.
Please refer to quick start document to install Paddle Operator.
In this example, we use the JuiceFS CSI plugin to store data and models, and it is also part of the Paddle Operator.
This example uses BOS as the storage backend. You can also use other JuiceFS supported object storage. Create Secret as following. The cache component of Paddle Operator needs the access-key / secret-key to access the object storage. And the metaurl is the access link of the metadata storage engine.
# criteo-secret.yaml
apiVersion: v1
data:
access-key: xxx
bucket: xxx
metaurl: xxx
name: Y3JpdGVv
secret-key: xxx
storage: Ym9z
kind: Secret
metadata:
name: criteo
namespace: paddle
type: Opaque
Note: Each field in data
needs to be encoded with base64
The cache component of Paddle Operator provides an abstraction of sample data sets through CRD: SampleSet. Create the following SampleSet and wait for the data to be synchronized. The nodeAffinity in the configuration can be used to specify which nodes the data should be cached to, for example, you can specify to cache the data to nodes with GPU device.
# criteo-sampleset.yaml
apiVersion: batch.paddlepaddle.org/v1alpha1
kind: SampleSet
metadata:
name: criteo
namespace: paddle-system
spec:
# Partitions of cache data
partitions: 1
source:
# Uri of sample data source
uri: bos://paddleflow-public.hkg.bcebos.com/criteo
secretRef:
# Secret to access data source
name: criteo-source
secretRef:
name: criteo
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: beta.kubernetes.io/instance-gpu
operator: In
values:
- "true"
Since the sample data has a size of 22G, it may take a while after the SampleSet is created, and the model training can be started after SampleSet's status is changed to Ready.
$ kubectl apply -f criteo-sampleset.yaml
sampleset.batch.paddlepaddle.org/imagenet created
$ kubectl get sampleset criteo -n paddle-system
NAME TOTAL SIZE CACHED SIZE AVAIL SPACE RUNTIME PHASE AGE
criteo 22 GiB 22 GiB 9.4 GiB 1/1 Ready 2d6h
This step is mainly to create PV and PVC resource to store the models. We create PV and PVC by JuiceFS CSI , and the storage backend is still BOS. You can also use other CSI plugins, such as Ceph / Glusterfs and so on.
Create PV
apiVersion: v1
kind: PersistentVolume
metadata:
name: model-center
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 10Pi
csi:
driver: csi.juicefs.com
fsType: juicefs
# Secret to access remote object storage
nodePublishSecretRef:
name: criteo
namespace: paddle-system
# Bucket of object storage, the model files is store in this path
volumeHandle: model-center
persistentVolumeReclaimPolicy: Retain
storageClassName: model-center
volumeMode: Filesystem
Create PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-center
namespace: paddle-system
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Pi
storageClassName: model-center
volumeMode: Filesystem
The model training script of this example is from PaddleRec project,
and image is placed in: registry.baidubce.com/paddleflow-public/paddlerec:2.1.0-gpu-cuda10.2-cudnn7
.
This example uses the Collective mode for training, so a GPU device is required.
You can also use Parameter Server mode for model training.
And the cpu version image of PaddleRec is placed: registry.baidubce.com/paddleflow-public/paddlerec:2.1.0
.
Each model of the PaddleRec project has a config file. In this demo we use configmap to store configurations and mount it into containers of PaddleJob, so we modify the configurations more convenient. Please refer to the document: PaddleRec config.yaml Configuration Instructions.
# wide_deep_config.yaml
# global settings
runner:
train_data_dir: "/mnt/criteo/slot_train_data_full"
train_reader_path: "criteo_reader" # importlib format
use_gpu: True
use_auc: True
train_batch_size: 4096
epochs: 4
print_interval: 10
model_save_path: "/mnt/model"
test_data_dir: "/mnt/criteo/slot_test_data_full"
infer_reader_path: "criteo_reader" # importlib format
infer_batch_size: 4096
infer_load_path: "/mnt/model"
infer_start_epoch: 0
infer_end_epoch: 4
use_inference: True
save_inference_feed_varnames: ["C1","C2","C3","C4","C5","C6","C7","C8","C9","C10","C11","C12","C13","C14","C15","C16","C17","C18","C19","C20","C21","C22","C23","C24","C25","C26","dense_input"]
save_inference_fetch_varnames: ["sigmoid_0.tmp_0"]
#use fleet
use_fleet: True
# hyper parameters of user-defined network
hyper_parameters:
# optimizer config
optimizer:
class: Adam
learning_rate: 0.001
strategy: async
# user-defined <key, value> pairs
sparse_inputs_slots: 27
sparse_feature_number: 1000001
sparse_feature_dim: 9
dense_input_dim: 13
fc_sizes: [512, 256, 128, 32]
distributed_embedding: 0
Create ConfigMap named as wide-deep-config
kubectl create configmap wide-deep-config -n paddle-system --from-file=wide_deep_config.yaml
PaddleJob is a custom resource in the Paddle Operator project, used to define Paddle training jobs.
# wide-deep.yaml
apiVersion: batch.paddlepaddle.org/v1
kind: PaddleJob
metadata:
name: wide-deep
namespace: paddle-system
spec:
cleanPodPolicy: Never
sampleSetRef:
name: criteo
mountPath: /mnt/criteo
worker:
replicas: 1
template:
spec:
containers:
- name: paddlerec
image: registry.baidubce.com/paddleflow-public/paddlerec:2.1.0-gpu-cuda10.2-cudnn7
workingDir: /home/PaddleRec/models/rank/wide_deep
command: ["/bin/bash", "-c", "cp /mnt/config/wide_deep_config.yaml . && mkdir -p /mnt/model/wide-deep && python -m paddle.distributed.launch --log_dir /mnt/model/log --gpus '0,1' ../../../tools/trainer.py -m wide_deep_config.yaml"]
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /mnt/config
name: config-volume
- mountPath: /mnt/model
name: model-volume
resources:
limits:
nvidia.com/gpu: 2
volumes:
- name: dshm
emptyDir:
medium: Memory
- name: config-volume
configMap:
name: wide-deep-config
- name: model-volume
persistentVolumeClaim:
claimName: model-center
Create PaddleJob and check it status
$ kubectl create -f wide-deep.yaml
$ kubectl get paddlejob wide-deep -n paddle-system
NAME STATUS MODE WORKER AGE
wide-deep Running Collective 2/2 2m
After PaddleJob is finished, you may notice that the model file stored in numeric dir, such as 0/
. During the model training process, Paddle will save a checkpoint after each epoch is done,
so you use the model files in the folder with the largest number to deploy service. Before deploy model serving, you need to convert the model file the format that Paddle Serving can use.
And we have already placed model files in bucket: https://paddleflow-public.hkg.bcebos.com/models/wide-deep/wide-deep.tar.gz
.
The directory structure:
.
├── rec_inference.pdiparams
├── rec_inference.pdmodel
├── rec_static.pdmodel
├── rec_static.pdopt
└── rec_static.pdparams
apiVersion: elasticserving.paddlepaddle.org/v1
kind: PaddleService
metadata:
name: wide-deep-serving
namespace: paddleservice-system
spec:
default:
arg: wget https://paddleflow-public.hkg.bcebos.com/models/wide-deep/wide-deep.tar.gz &&
tar xzf wide-deep.tar.gz && rm -rf wide-deep.tar.gz &&
python3 -m paddle_serving_client.convert --dirname wide-deep/ --model_filename rec_inference.pdmodel --params_filename rec_inference.pdiparams &&
python3 -m paddle_serving_server.serve --model serving_server --port 9292
containerImage: registry.baidubce.com/paddleflow-public/serving
port: 9292
tag: v0.6.2
runtimeVersion: paddleserving
service:
minScale: 1
# Check service in namespace paddleservice-system
kubectl get svc -n paddleservice-system | grep paddleservice-criteoctr
# Check knative service in namespace paddleservice-system
kubectl get ksvc paddleservice-criteoctr -n paddleservice-system
# Check pods in namespace paddleservice-system
kubectl get pods -n paddleservice-system
# Obtain ClusterIP
kubectl get svc wide-deep-serving-default-private -n paddleservice-system
The model service supports HTTP / BRPC / GRPC client, refer to Serving
to obtain the code of client. It should be noted that you need to replace the service IP address and port in the client code with the cluster-ip and port from the wide-deep-serving-default-private
service mentioned above.