From a15da7b566bcd3a0427e843cf38e649e4c95bb14 Mon Sep 17 00:00:00 2001 From: irfanpena Date: Thu, 15 Aug 2024 15:14:44 +0700 Subject: [PATCH] Update the Runtime to use mermaid.js --- docs/architecture.md | 16 +++++++- docs/benchmarking-architecture.mdx | 36 +++++++++++++++++- docs/cortex-onnx.mdx | 61 ++++++++++++++++++++++++++++-- docs/cortex-tensorrt-llm.mdx | 42 +++++++++++++++++++- docs/telemetry-architecture.mdx | 16 +++++++- 5 files changed, 163 insertions(+), 8 deletions(-) diff --git a/docs/architecture.md b/docs/architecture.md index 01a3b3b..175b895 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -110,7 +110,21 @@ main.ts # Entrypoint ``` ## Runtime -![cortex runtime](/img/docs/cortex-runtime.png) +```mermaid +sequenceDiagram + User-)Cortex: "Tell me a joke" + Cortex->>Model Controller/Service: Pull the Model + Cortex->>Model Controller/Service: Load the model + Cortex->>Chat Controller/Service: createChatCompletions() + Chat Controller/Service -->> Model Entity: findOne() + Cortex->>Model Entity: Store the model data + Chat Controller/Service -->> Extension Repository: findAll() + + %% Responses + Extension Repository ->> Chat Controller/Service: Response stream + Chat Controller/Service ->> Chat Controller/Service: Formatted response/stream + Chat Controller/Service ->> User: "Your mama" +``` The sequence diagram above outlines the interactions between various components in the Cortex system during runtime, particularly when handling user requests via a CLI. Here’s a detailed breakdown of the runtime sequence: diff --git a/docs/benchmarking-architecture.mdx b/docs/benchmarking-architecture.mdx index 88fe7dd..ce6ae87 100644 --- a/docs/benchmarking-architecture.mdx +++ b/docs/benchmarking-architecture.mdx @@ -27,7 +27,41 @@ The benchmarking capability comprises several key components: 5. **Web Interface**: The processed data and benchmarking results are made available to users through the web interface which is the Cortex's CLI. Users can view, and convert the benchmarking reports to `JSON` from this interface. ## Implementation -![benchmark-implementation](/img/docs/benchmark-flow.png) +```mermaid +sequenceDiagram + participant User + participant Command as Command Line + participant Config as Config Loader + participant System as System Information + participant API as OpenAI API + participant Calc as Calculations + participant File as File System + + User->>Command: Run script with --config option + Command->>Config: Validate config path + Config->>Config: Read and parse YAML file + Config-->>Command: Return config object + Command->>System: Get initial system resource + System-->>Command: Return initial data + + loop Each Benchmark Round (num_rounds) + Command->>API: Request chat completions + API-->>Command: Stream responses + Command->>Calc: Calculate metrics (Token Count, TTFT) + Calc-->>Command: Return intermediate result + + Command->>System: Get end system resources post API call + System-->>Command: Return end data + Command->>Calc: Compute resource change, Throughput, TPOT, Latency, context length + Calc-->>Command: Return comprehensive metrics + end + Command->>Calc: Aggregate result to calculate percentiles (p50, p75, p95) + Calc-->>Command: Return aggregated metrics + Command->>File: Write result to JSON (include metrics and hardware changes) + File-->>User: Save output.json + +``` + The diagram illustrates the implementation of how the benchmarking works in the Cortex environment: 1. **User**: - The user runs the benchmarking command with a `--config` option. diff --git a/docs/cortex-onnx.mdx b/docs/cortex-onnx.mdx index c7eaed5..c9e35c6 100644 --- a/docs/cortex-onnx.mdx +++ b/docs/cortex-onnx.mdx @@ -97,7 +97,23 @@ These are the main components that interact to provide an API for `inference` ta ### Communication Protocols #### Load a Model -![Load Model](/img/docs/onnx-2.png) +```mermaid +sequenceDiagram + participant JS as Cortex-JS + participant CPP as Cortex-CPP + participant ONNX as Cortex.onnx + + JS->>CPP: HTTP request load model + CPP->>CPP: Load engine + CPP->>ONNX: Load model + ONNX->>ONNX: Cache chat template + ONNX->>ONNX: Create onnx model + ONNX->>ONNX: Create tokenizer + ONNX-->>CPP: Callback + CPP-->>JS: HTTP response + +``` + The diagram above illustrates the interaction between three components: `cortex-js`, `cortex-cpp`, and `cortex.onnx` when using the `onnx` engine in Cortex: 1. **HTTP Request from cortex-js to cortex-cpp**: @@ -122,7 +138,28 @@ The diagram above illustrates the interaction between three components: `cortex- - `cortex-cpp` sends an HTTP response back to `cortex-js`, indicating that the model has been successfully loaded and is ready for use. #### Stream Inference -![Stream Inference](/img/docs/onnx-3.png) +```mermaid +sequenceDiagram + participant JS as Cortex-JS + participant CPP as Cortex-CPP + participant ONNX as Cortex.onnx + + JS->>CPP: HTTP request chat completion + CPP->>ONNX: Request chat completion + ONNX->>ONNX: Apply chat template + ONNX->>ONNX: Encode + ONNX->>ONNX: Set search options + ONNX->>ONNX: Create generator + loop Wait for done + ONNX->>ONNX: Compute logits + ONNX->>ONNX: Generate next token + ONNX->>ONNX: Decode new token + ONNX-->>CPP: Callback + CPP-->>JS: HTTP stream response + end + +``` + The diagram above illustrates the interaction between three components: `cortex-js`, `cortex-cpp`, and `cortex.onnx` when using the `onnx` engine to call the `chat completions endpoint` with the stream inference option: 1. **HTTP Request from cortex-js to `cortex-cpp`**: @@ -154,7 +191,25 @@ The diagram above illustrates the interaction between three components: `cortex- - `cortex-js` waits until the entire response is received and the process is completed. #### Non-stream Inference -![Non-stream Inference](/img/docs/onnx-4.png) + +```mermaid +sequenceDiagram + participant JS as Cortex-JS + participant CPP as Cortex-CPP + participant ONNX as Cortex.onnx + + JS->>CPP: HTTP request chat completion + CPP->>ONNX: Request chat completion + ONNX->>ONNX: Apply chat template + ONNX->>ONNX: Encode + ONNX->>ONNX: Set search options + ONNX->>ONNX: Create generator + ONNX->>ONNX: Generate output + ONNX->>ONNX: Decode output + ONNX-->>CPP: Callback + CPP-->>JS: HTTP response + +``` The diagram above illustrates the interaction between three components: `cortex-js`, `cortex-cpp`, and `cortex.onnx` when using the `onnx` engine to call the `chat completions endpoint` with the non-stream inference option: 1. **HTTP Request from `cortex-js` to `cortex-cpp`**: diff --git a/docs/cortex-tensorrt-llm.mdx b/docs/cortex-tensorrt-llm.mdx index 07aad8d..3d08c8a 100644 --- a/docs/cortex-tensorrt-llm.mdx +++ b/docs/cortex-tensorrt-llm.mdx @@ -119,7 +119,24 @@ These are the main components that interact to provide an API for `inference` ta ### Communication Protocols #### Load a Model -![Load Model](/img/docs/trt-2.png) +```mermaid +sequenceDiagram + participant JS as Cortex-JS + participant CPP as Cortex-CPP + participant TRT as Cortex.tensorrt-llm + + JS->>CPP: HTTP request load model + CPP->>CPP: Load model + CPP->>TRT: Load model + TRT->>TRT: Cache chat template + TRT->>TRT: Create tokenizer + TRT->>TRT: Load config + TRT->>TRT: Initialize GPT Session + TRT-->>CPP: Callback + CPP-->>JS: HTTP response + +``` + The diagram above illustrates the interaction between three components: `cortex-js`, `cortex-cpp`, and `cortex.tensorrt-llm` when using the `tensorrt-llm` engine in Cortex: 1. **HTTP Request Load Model (cortex-js to cortex-cpp)**: @@ -151,7 +168,28 @@ The diagram above illustrates the interaction between three components: `cortex- #### Inference -![Inference](/img/docs/trt-3.png) +```mermaid +sequenceDiagram + participant JS as Cortex-JS + participant CPP as Cortex-CPP + participant TRT as Cortex.tensorrt-llm + + JS->>CPP: HTTP request chat completion + CPP->>TRT: Request chat completion + TRT->>TRT: Apply chat template + TRT->>TRT: Encode + TRT->>TRT: Set sampling config + TRT->>TRT: Create generator input/output + loop Wait for done + TRT->>TRT: Copy new token from GPU + TRT->>TRT: Decode new token + TRT-->>CPP: Callback + CPP-->>JS: HTTP stream response + end + + +``` + The diagram above illustrates the interaction between three components: `cortex-js`, `cortex-cpp`, and `cortex.tensorrt-llm` when using the `tensorrt-llm` engine to call the `chat completions endpoint` with the inference option: 1. **HTTP Request Chat Completion (cortex-js to cortex-cpp)**: diff --git a/docs/telemetry-architecture.mdx b/docs/telemetry-architecture.mdx index 67ffb55..f22962e 100644 --- a/docs/telemetry-architecture.mdx +++ b/docs/telemetry-architecture.mdx @@ -154,7 +154,21 @@ The telemetry capability comprises several key components: } ``` ## Implementation -![catch-crash.png](/img/docs/telemetry2.png) +```mermaid +sequenceDiagram + participant User + participant Server as Cortex Server Main Process + participant Child as Child Process + participant Dir as Local Dir + + User->>Server: Command/Request + Server->>Child: Fork child process + Server-->>Server: Catch uncaughtException, unhandledRejection event from process + Server->>Dir: Fork child process + Child->>Server: Interval ping to main process + Child->>Dir: Insert crash report +``` + The diagram illustrates the implementation of handling crashes in the Cortex server environment: 1. **User Interaction**