Merge pull request #180 from janhq/pena-patch

Update the Runtime to use mermaid.js
janhq · Aug 16, 2024 · 603a62d · 603a62d
2 parents 6fc799c + a15da7b
commit 603a62d
Show file tree

Hide file tree

Showing 5 changed files with 163 additions and 8 deletions.
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -110,7 +110,21 @@ main.ts                    # Entrypoint
 ```
 
 ## Runtime
-![cortex runtime](/img/docs/cortex-runtime.png)
+```mermaid
+sequenceDiagram
+    User-)Cortex: "Tell me a joke"
+    Cortex->>Model Controller/Service: Pull the Model
+    Cortex->>Model Controller/Service: Load the model
+    Cortex->>Chat Controller/Service: createChatCompletions()
+    Chat Controller/Service -->> Model Entity: findOne()
+    Cortex->>Model Entity: Store the model data
+    Chat Controller/Service -->> Extension Repository: findAll()
+
+    %% Responses
+    Extension Repository ->> Chat Controller/Service: Response stream
+    Chat Controller/Service ->> Chat Controller/Service: Formatted response/stream
+    Chat Controller/Service ->> User: "Your mama"
+```
 
 The sequence diagram above outlines the interactions between various components in the Cortex system during runtime, particularly when handling user requests via a CLI. Here’s a detailed breakdown of the runtime sequence:
 

diff --git a/docs/benchmarking-architecture.mdx b/docs/benchmarking-architecture.mdx
@@ -27,7 +27,41 @@ The benchmarking capability comprises several key components:
 5. **Web Interface**: The processed data and benchmarking results are made available to users through the web interface which is the Cortex's CLI. Users can view, and convert the benchmarking reports to `JSON` from this interface.
 
 ## Implementation
-![benchmark-implementation](/img/docs/benchmark-flow.png)
+```mermaid
+sequenceDiagram
+    participant User
+    participant Command as Command Line
+    participant Config as Config Loader
+    participant System as System Information
+    participant API as OpenAI API
+    participant Calc as Calculations
+    participant File as File System
+
+    User->>Command: Run script with --config option
+    Command->>Config: Validate config path
+    Config->>Config: Read and parse YAML file
+    Config-->>Command: Return config object
+    Command->>System: Get initial system resource
+    System-->>Command: Return initial data
+
+    loop Each Benchmark Round (num_rounds)
+        Command->>API: Request chat completions
+        API-->>Command: Stream responses
+        Command->>Calc: Calculate metrics (Token Count, TTFT)
+        Calc-->>Command: Return intermediate result
+
+         Command->>System: Get end system resources post API call
+         System-->>Command: Return end data
+         Command->>Calc: Compute resource change, Throughput, TPOT, Latency, context length
+         Calc-->>Command: Return comprehensive metrics
+      end
+    Command->>Calc: Aggregate result to calculate percentiles (p50, p75, p95)
+    Calc-->>Command: Return aggregated metrics
+    Command->>File: Write result to JSON (include metrics and hardware changes)
+    File-->>User: Save output.json
+
+```
+
 The diagram illustrates the implementation of how the benchmarking works in the Cortex environment:
 1. **User**:
    - The user runs the benchmarking command with a `--config` option.

diff --git a/docs/cortex-onnx.mdx b/docs/cortex-onnx.mdx
@@ -97,7 +97,23 @@ These are the main components that interact to provide an API for `inference` ta
 ### Communication Protocols
 
 #### Load a Model
-![Load Model](/img/docs/onnx-2.png)
+```mermaid
+sequenceDiagram
+    participant JS as Cortex-JS
+    participant CPP as Cortex-CPP
+    participant ONNX as Cortex.onnx
+
+    JS->>CPP: HTTP request load model
+    CPP->>CPP: Load engine
+    CPP->>ONNX: Load model
+    ONNX->>ONNX: Cache chat template
+    ONNX->>ONNX: Create onnx model
+    ONNX->>ONNX: Create tokenizer
+    ONNX-->>CPP: Callback
+    CPP-->>JS: HTTP response
+
+```
+
 The diagram above illustrates the interaction between three components: `cortex-js`, `cortex-cpp`, and `cortex.onnx` when using the `onnx` engine in Cortex:
 
 1. **HTTP Request from cortex-js to cortex-cpp**:
@@ -122,7 +138,28 @@ The diagram above illustrates the interaction between three components: `cortex-
    - `cortex-cpp` sends an HTTP response back to `cortex-js`, indicating that the model has been successfully loaded and is ready for use.
 
 #### Stream Inference
-![Stream Inference](/img/docs/onnx-3.png)
+```mermaid
+sequenceDiagram
+    participant JS as Cortex-JS
+    participant CPP as Cortex-CPP
+    participant ONNX as Cortex.onnx
+
+    JS->>CPP: HTTP request chat completion
+    CPP->>ONNX: Request chat completion
+    ONNX->>ONNX: Apply chat template
+    ONNX->>ONNX: Encode
+    ONNX->>ONNX: Set search options
+    ONNX->>ONNX: Create generator
+    loop Wait for done
+        ONNX->>ONNX: Compute logits
+        ONNX->>ONNX: Generate next token
+        ONNX->>ONNX: Decode new token
+        ONNX-->>CPP: Callback
+        CPP-->>JS: HTTP stream response
+    end
+
+```
+
 The diagram above illustrates the interaction between three components: `cortex-js`, `cortex-cpp`, and `cortex.onnx` when using the `onnx` engine to call the `chat completions endpoint` with the stream inference option:
 
 1. **HTTP Request from cortex-js to `cortex-cpp`**:
@@ -154,7 +191,25 @@ The diagram above illustrates the interaction between three components: `cortex-
    - `cortex-js` waits until the entire response is received and the process is completed.
 
 #### Non-stream Inference
-![Non-stream Inference](/img/docs/onnx-4.png)
+
+```mermaid
+sequenceDiagram
+    participant JS as Cortex-JS
+    participant CPP as Cortex-CPP
+    participant ONNX as Cortex.onnx
+
+    JS->>CPP: HTTP request chat completion
+    CPP->>ONNX: Request chat completion
+    ONNX->>ONNX: Apply chat template
+    ONNX->>ONNX: Encode
+    ONNX->>ONNX: Set search options
+    ONNX->>ONNX: Create generator
+    ONNX->>ONNX: Generate output
+    ONNX->>ONNX: Decode output
+    ONNX-->>CPP: Callback
+    CPP-->>JS: HTTP response
+
+```
 The diagram above illustrates the interaction between three components: `cortex-js`, `cortex-cpp`, and `cortex.onnx` when using the `onnx` engine to call the `chat completions endpoint` with the non-stream inference option:
 
 1. **HTTP Request from `cortex-js` to `cortex-cpp`**:

diff --git a/docs/cortex-tensorrt-llm.mdx b/docs/cortex-tensorrt-llm.mdx
@@ -119,7 +119,24 @@ These are the main components that interact to provide an API for `inference` ta
 ### Communication Protocols
 
 #### Load a Model
-![Load Model](/img/docs/trt-2.png)
+```mermaid
+sequenceDiagram
+    participant JS as Cortex-JS
+    participant CPP as Cortex-CPP
+    participant TRT as Cortex.tensorrt-llm
+
+    JS->>CPP: HTTP request load model
+    CPP->>CPP: Load model
+    CPP->>TRT: Load model
+    TRT->>TRT: Cache chat template
+    TRT->>TRT: Create tokenizer
+    TRT->>TRT: Load config
+    TRT->>TRT: Initialize GPT Session
+    TRT-->>CPP: Callback
+    CPP-->>JS: HTTP response
+
+```
+
 The diagram above illustrates the interaction between three components: `cortex-js`, `cortex-cpp`, and `cortex.tensorrt-llm` when using the `tensorrt-llm` engine in Cortex:
 
 1. **HTTP Request Load Model (cortex-js to cortex-cpp)**:
@@ -151,7 +168,28 @@ The diagram above illustrates the interaction between three components: `cortex-
 
 
 #### Inference
-![Inference](/img/docs/trt-3.png)
+```mermaid
+sequenceDiagram
+    participant JS as Cortex-JS
+    participant CPP as Cortex-CPP
+    participant TRT as Cortex.tensorrt-llm
+
+    JS->>CPP: HTTP request chat completion
+    CPP->>TRT: Request chat completion
+    TRT->>TRT: Apply chat template
+    TRT->>TRT: Encode
+    TRT->>TRT: Set sampling config
+    TRT->>TRT: Create generator input/output
+    loop Wait for done
+        TRT->>TRT: Copy new token from GPU
+        TRT->>TRT: Decode new token
+        TRT-->>CPP: Callback
+        CPP-->>JS: HTTP stream response
+    end
+
+
+```
+
 The diagram above illustrates the interaction between three components: `cortex-js`, `cortex-cpp`, and `cortex.tensorrt-llm` when using the `tensorrt-llm` engine to call the `chat completions endpoint` with the inference option:
 
 1. **HTTP Request Chat Completion (cortex-js to cortex-cpp)**:

diff --git a/docs/telemetry-architecture.mdx b/docs/telemetry-architecture.mdx
@@ -154,7 +154,21 @@ The telemetry capability comprises several key components:
 }
 ```
 ## Implementation
-![catch-crash.png](/img/docs/telemetry2.png)
+```mermaid
+sequenceDiagram
+    participant User
+    participant Server as Cortex Server Main Process
+    participant Child as Child Process
+    participant Dir as Local Dir
+
+    User->>Server: Command/Request
+    Server->>Child: Fork child process
+    Server-->>Server: Catch uncaughtException, unhandledRejection event from process
+    Server->>Dir: Fork child process
+    Child->>Server: Interval ping to main process
+    Child->>Dir: Insert crash report
+```
+
 The diagram illustrates the implementation of handling crashes in the Cortex server environment:
 
 1. **User Interaction**