Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: change_for_OpenMLDB_Byzer_taxi_of_integration_folder #3469

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
373 changes: 373 additions & 0 deletions docs/en/integration/deploy_integration/OpenMLDB_Byzer_taxi.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,373 @@
# Byzer

This article demonstrates how to use [OpenMLDB](https://github.com/4paradigm/OpenMLDB) and [Byzer](https://www.byzer.org/home) together to accomplish a complete machine-learning application. In this example, OpenMLDB receives instructions and data sent by Byzer, performs real-time feature computation on the data, and returns the processed dataset through feature engineering to Byzer for subsequent machine learning training and inference.

## Preparation

### Install OpenMLDB

1. This example recommends using the OpenMLDB cluster version running in a Docker container. For installation steps, please refer to [OpenMLDB Quickstart](../../quickstart/openmldb_quickstart.md).
2. In this example, although the Byzer engine is on the same host, it needs to access OpenMLDB services from outside the container. Therefore, the service port of the OpenMLDB cluster needs to be exposed. It is recommended to use the `--network host` method, as detailed in the [IP Configuration Documentation - CLI/SDK->containeronebox](../../reference/ip_tips.md#clisdk-containeronebox).
3. For simplicity, we use the file format to import and export OpenMLDB cluster data, so Byzer and OpenMLDB need to share the file path. Here, we map `/mlsql/admin` to `/byzermnt`, and use `/byzermnt` as the file path in SQL commands that interact with OpenMLDB.
4. We also need to create a database named `db1` in the OpenMLDB cluster, and then use this database in Byzer (currently cannot be executed in Byzer, and Byzer must specify a usable db to connect to OpenMLDB).

The command is as follows:
```
docker run --network host -dit --name openmldb -v /mlsql/admin/:/byzermnt 4pdosc/openmldb:0.8.4 bash
docker exec -it openmldb bash
/work/init.sh
echo "create database db1;" | /work/openmldb/bin/openmldb --zk_cluster=127.0.0.1:2181 --zk_root_path=/openmldb --role=sql_client
exit # exit container
```

### Install Byzer Engine and Byzer Notebook

The example uses [Byzer All In One Deployment](https://docs.byzer.org/#/byzer-lang/zh-cn/installation/server/byzer-all-in-one-deployment) and [Byzer Notebook Binary System Installation](https://docs.byzer.org/#/byzer-notebook/zh-cn/installation/install_uninstall) method to install Byzer component.

```{note}
If you only need to install the OpenMLDB plugin offline, you can also use the [Sandbox Container Deployment](https://docs.byzer.org/#/byzer-lang/zh-cn/installation/containerized-deployment/sandbox-standalone), where you can start and install the OpenMLDB plugin offline with just one click.

If you use VSCode, you can also choose the [Byzer Plugin in VSCode] (https://docs.byzer.org/#/byzer-Lang/zh-cn/installation/vscode/byzer-vscode-extension-installation). The plugin includes the built-in Byzer All In One, eliminating the need for manual installation.

For other deployment methods, please refer to the [Byzer Engine Deployment Guidelines] (https://docs.byzer.org/#/byzer-Lang/zh-cn/installation/README).
```

1. Install Byzer All In One
```
wget https://download.byzer.org/byzer/2.3.0/byzer-lang-all-in-one-linux-amd64-3.1.1-2.3.0.tar.gz
tar -zxvf byzer-lang-all-in-one-linux-amd64-3.1.1-2.3.0.tar.gz
cd byzer-lang-all-in-one-linux-amd64-3.1.1-2.3.0
# If you have a Java (jdk8 or higher) environment, you can skip two steps to export
export JAVA_HOME=$(pwd)/jdk8
export PATH=$JAVA_HOME/bin:$PATH
./bin/byzer.sh start
```
You can visit `http://<ip>:9003/`.

2. Install Byzer Notebook. As Byzer Notebook [Requires mysql](https://docs.byzer.org/#/byzer-notebook/zh-cn/installation/prerequisites), if you do not have a MySQL engine, you can start one through Docker mode.
```
docker run -d --name mysql -e MYSQL_ROOT_PASSWORD=root -e MYSQL_ROOT_HOST=% -p 3306:3306 byzer/mysql:8.0-20.04_beta
wget https://download.byzer.org/byzer-notebook/1.2.3/Byzer-Notebook-1.2.3.tar.gz
tar -zxvf Byzer-Notebook-1.2.3.tar.gz
cd Byzer-Notebook-1.2.3
./bin/bootstrap.sh start
```
You can visit `http://<ip>:9002/`. The user id and password are admin/admin. The webpage is shown below. This article uses Byzer Notebook for demonstration.

![Byzer_Notebook](images/Byzer_Notebook.jpg)

### Byzer OpenMLDB Plugin

This example requires the use of the [OpenMLDB Plugin](https://github.com/byzer-org/byzer-extension/tree/master/byzer-openmldb) provided by Byzer to accomplish message delivery with OpenMLDB. We can install it in Byzer Notebook, create a Notebook, add Cells, and execute the following commands:

```
!plugin app add - "byzer-openmldb-3.0";
```

After running Cell, the plugin will be downloaded and installed. The process will take some time.
```{note}
If the installation is not successful or the download is too slow, you can manually download the jar package and then go offline (https://docs.byzer.org/#/byzer-lang/zh-cn/extension/installation/offline_install) for installation and configuration.
```

### Prepare Dataset

This article utilizes the Kaggle taxi driving time dataset. For the sake of demonstration, we will only use a portion of the data, which can be downloaded from [here](https://openmldb.ai/download/taxi_tour_table_train_simple.csv) and then uploaded to Byzer Notebook.

![byzer upload](images/byzer-upload-data.png)

After uploading, you can find it in the Data Catalog-File System of the Byzer Notebook.
```{note}
If you prefer to use the full dataset, you can obtain it from the following website: [Kaggle Taxi Driving Time Prediction Problem](https://www.kaggle.com/c/nyc-taxi-trip-duration/overview). After downloading the dataset locally, it needs to be imported into Byzer Notebook.
```
## Machine Learning Process

By creating a Notebook in Byzer Notebook, you can start writing the entire machine learning process.

### Step 1: Check Dataset
[Prepare Dataset](#prepare-dataset) has been imported into the File System with the path `tmp/upload`. Use the Byzer Lang `load` command to load the data.

```sql
load csv.`tmp/upload/taxi_tour_table_train_simple.csv` where delimiter=","
and header = "true"
as taxi_tour_table_train_simple;
```
After running Cell, you can see the browsing result of the data.
![byzer load result](images/byzer-load-data.png)

### Step 2: OpenMLDB Create Tables and Import Data

To connect to the OpenMLDB engine, we will use plugins. Since the OpenMLDB cluster and Byzer are deployed on the same host, we can access OpenMLDB using the local address `127.0.0.1:2181`. Before running this code block in Byzer Notebook, make sure that the OpenMLDB engine is started.

```sql
run command as FeatureStoreExt.`` where
zkAddress="127.0.0.1:2181"
and zkPath="/openmldb"
and `sql-0`='''
SET @@execute_mode='offline';
'''
and `sql-1`='''
SET @@sync_job=true;
'''
and `sql-2`='''
SET @@job_timeout=20000000;
'''
and `sql-3`='''
CREATE TABLE IF NOT EXISTS t1(id string, vendor_id int, pickup_datetime timestamp, dropoff_datetime timestamp, passenger_count int, pickup_longitude double, pickup_latitude double, dropoff_longitude double, dropoff_latitude double, store_and_fwd_flag string, trip_duration int);
'''
and `sql-4`='''
LOAD DATA INFILE '/byzermnt/tmp/upload/taxi_tour_table_train_simple.csv'
INTO TABLE t1 options(format='csv',header=true,mode='overwrite');
'''
and db="db1"
and action="ddl";
```
After the task is completed, the Result should have a prompt of `FINISHED`. Please enter the openmldb container to check the job log if it is `FAILED`.

### Step 3: Perform Offline Feature Computation

Usually, this step involves feature design. However, in this example, we will skip the design phase and directly use the features designed in [OpenMLDB + LightGBM: Taxi Travel Time Prediction](../../use_case/taxi_tour_duration_prediction.md) for offline feature computation. The processed dataset will be exported as a local parquet file (parquet format is recommended; CSV load requires additional schema).

```sql
run command as FeatureStoreExt.`` where
zkAddress="127.0.0.1:2181"
and zkPath="/openmldb"
and `sql-0`='''
SET @@execute_mode='offline';
'''
and `sql-1`='''
SET @@sync_job=true;
'''
and `sql-2`='''
SET @@job_timeout=20000000;
'''
and `sql-3`='''
SELECT trip_duration, passenger_count,
sum(pickup_latitude) OVER w AS vendor_sum_pl,
max(pickup_latitude) OVER w AS vendor_max_pl,
min(pickup_latitude) OVER w AS vendor_min_pl,
avg(pickup_latitude) OVER w AS vendor_avg_pl,
sum(pickup_latitude) OVER w2 AS pc_sum_pl,
max(pickup_latitude) OVER w2 AS pc_max_pl,
min(pickup_latitude) OVER w2 AS pc_min_pl,
avg(pickup_latitude) OVER w2 AS pc_avg_pl ,
count(vendor_id) OVER w2 AS pc_cnt,
count(vendor_id) OVER w AS vendor_cnt
FROM t1
WINDOW w AS (PARTITION BY vendor_id ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW),
w2 AS (PARTITION BY passenger_count ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW) INTO OUTFILE '/byzermnt/tmp/feature_data' OPTIONS(mode='overwrite', format='parquet');
'''
and db="db1"
and action="ddl";
```
After the task is completed, the Result should show a prompt of `FINISHED`. If it shows `FAILED`, please enter the openmldb container to check the job log. Refresh the Data Catalog of the Byzer Notebook to see the generated feature file path and the `tmp/feature_data` in the File System.

### Step 4: Load Data in Byzer

Load the feature data generated in the previous step into the Byzer environment:
```sql
load parquet.`tmp/feature_data` as feature_data;
```

Convert all int-type fields to double.

```
select *,
cast(passenger_count as double) as passenger_count_d,
cast(pc_cnt as double) as pc_cnt_d,
cast(vendor_cnt as double) as vendor_cnt_d
from feature_data
as new_feature_data;
```

Then merge all the fields into one vector.

```sql
select vec_dense(array(
passenger_count_d,
vendor_sum_pl,
vendor_max_pl,
vendor_min_pl,
vendor_avg_pl,
pc_sum_pl,
pc_max_pl,
pc_min_pl,
pc_avg_pl,
pc_cnt_d,
vendor_cnt
)) as features,cast(trip_duration as double) as label
from new_feature_data
as training_table;

```

### Step 5: Model Training

Use the `train` command with the [Built-In Linear Regression Algorithm]([https://docs.byzer.org/#/byzer-lang/en-us/](https://docs.byzer.org/#/byzer-lang/zh-cn/ml/algs/linear_regression)) to train the model and save it to the specified path `/model/taxi-trip`.

```sql
train training_table as LinearRegression.`/model/taxi-trip` where
keepVersion="true"
and evaluateTable="training_table"
and `fitParam.0.labelCol`="label"
and `fitParam.0.featuresCol`= "features"
and `fitParam.0.maxIter`="50";
```

```{note}
View the relevant parameters of Byzer's Built-In Linear Regression Model using the command `!show et/params/LinearRegression;`.
```

### Step 6: Feature Deployment

Use `DEPLOY` (must be in online mode) to deploy the feature computation SQL to OpenMLDB. The SQL should be consistent with the offline feature calculation SQL and it will be named as `d1` for deployment.

```{note}
Note: The deployment name for `DEPLOY must be unique. If you want to change the deployment after it has been successful, you will need to change the deployment name or delete the existing deployment d1 before re-deploying.
```

```sql
run command as FeatureStoreExt.`` where
zkAddress="127.0.0.1:2181"
and zkPath="/openmldb"
and `sql-0`='''
SET @@execute_mode='online';
'''
and `sql-1`='''
DEPLOY d1 SELECT trip_duration, passenger_count,
sum(pickup_latitude) OVER w AS vendor_sum_pl,
max(pickup_latitude) OVER w AS vendor_max_pl,
min(pickup_latitude) OVER w AS vendor_min_pl,
avg(pickup_latitude) OVER w AS vendor_avg_pl,
sum(pickup_latitude) OVER w2 AS pc_sum_pl,
max(pickup_latitude) OVER w2 AS pc_max_pl,
min(pickup_latitude) OVER w2 AS pc_min_pl,
avg(pickup_latitude) OVER w2 AS pc_avg_pl ,
count(vendor_id) OVER w2 AS pc_cnt,
count(vendor_id) OVER w AS vendor_cnt
FROM t1
WINDOW w AS (PARTITION BY vendor_id ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW),
w2 AS (PARTITION BY passenger_count ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW);
'''
and db="db1"
and action="ddl";
```

### Step 7: Import Online Data

Online real-time prediction often involves importing recent historical data into online storage. In addition to importing data files, real-time data sources can also be accessed in the production environment. For simplicity, this example directly imports the original dataset (real-time prediction uses new real-time data as a request, so there will be no "feature data for prediction training").

```sql
run command as FeatureStoreExt.`` where
zkAddress="127.0.0.1:2181"
and zkPath="/openmldb"
and `sql-0`='''
SET @@execute_mode='online';
'''
and `sql-1`='''
SET @@sync_job=true;
'''
and `sql-2`='''
SET @@job_timeout=20000000;
'''
and `sql-3`='''
LOAD DATA INFILE '/byzermnt/tmp/upload/taxi_tour_table_train_simple.csv'
INTO TABLE t1 options(format='csv',mode='append');
'''
and db="db1"
and action="ddl";
```

### Step 8: Model Deployment

Register the previously saved and trained model as a function that can be used directly.

```sql
register LinearRegression.`/model/taxi-trip` as taxi_trip_model_predict;
```

### Step 9: Real-Time Predictive Testing

Typically, real-time feature prediction is driven by real-time data. For the convenience of this demonstration, we will still perform "real-time feature computation + prediction" in the Notebook. We will use the [Python Environment]([https://docs.byzer.org/#/byzer-lang/en-us/](https://docs.byzer.org/#/byzer-lang/zh-cn/python/env)) for real-time feature computation by using the requirements file.

```
pyarrow==4.0.1
ray[default]==1.8.0
aiohttp==3.7.4
pandas>=1.0.5; python_version < '3.7'
pandas>=1.2.0; python_version >= '3.7'
requests
matplotlib~=3.3.4
uuid~=1.30
pyjava
protobuf==3.20.0 # New, if the protobuf version is too high, there will be import ray errors
```
```
pip install -r requirements.txt
```

We will construct "real-time data" and request OpenMLDB to compute real-time features using HTTP. The computed features will be saved as a file and then loaded into the Byzer environment.
```
!python env "PYTHON_ENV=:";
!python conf "runIn=driver";
!python conf "schema=file";
run command as Ray.`` where
inputTable="command"
and outputTable="test_feature"
and code='''
import numpy as np
import os
import pandas as pd
import ray
import requests
import json
from pyjava.api.mlsql import RayContext,PythonContext

ray_context = RayContext.connect(globals(),None)

resp = requests.post('http://127.0.0.1:9080/dbs/db1/deployments/d1', json=json.loads('{"input":[["id0376262", 1, 1467302350000, 1467304896000, 2, -73.873093, 40.774097, -73.926704, 40.856739, "N", 1]], "need_schema":true}'))

res = json.loads(resp.text)["data"]
schema_names = [(col["name"], col["type"]) for col in res["schema"]]
df = pd.DataFrame.from_records(np.array([tuple(res["data"][0])], dtype=schema_names))
df.to_parquet('/mlsql/admin/tmp/test_feature.parquet')

context.build_result([])
''';
```

Convert all int-type fields of the processed online data to double.

```sql
select *,
cast(passenger_count as double) as passenger_count_d,
cast(pc_cnt as double) as pc_cnt_d,
cast(vendor_cnt as double) as vendor_cnt_d
from feature_data_test
as new_feature_data_test;
```

Then perform vectorization.

```sql
select vec_dense(array(
passenger_count_d,
vendor_sum_pl,
vendor_max_pl,
vendor_min_pl,
vendor_avg_pl,
pc_sum_pl,
pc_max_pl,
pc_min_pl,
pc_avg_pl,
pc_cnt_d,
vendor_cnt
)) as features
from new_feature_data_test
as testing_table;
```

Using the processed test set for prediction, the result is the predicted `trip_duration`.

```sql
select taxi_trip_model_predict(testing_table) as predict_label;
```
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading