Skip to content
This repository has been archived by the owner on Oct 14, 2024. It is now read-only.

Commit

Permalink
Merge pull request #180 from janhq/pena-patch
Browse files Browse the repository at this point in the history
Update the Runtime to use mermaid.js
  • Loading branch information
irfanpena authored Aug 16, 2024
2 parents 6fc799c + a15da7b commit 603a62d
Show file tree
Hide file tree
Showing 5 changed files with 163 additions and 8 deletions.
16 changes: 15 additions & 1 deletion docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,21 @@ main.ts # Entrypoint
```

## Runtime
![cortex runtime](/img/docs/cortex-runtime.png)
```mermaid
sequenceDiagram
User-)Cortex: "Tell me a joke"
Cortex->>Model Controller/Service: Pull the Model
Cortex->>Model Controller/Service: Load the model
Cortex->>Chat Controller/Service: createChatCompletions()
Chat Controller/Service -->> Model Entity: findOne()
Cortex->>Model Entity: Store the model data
Chat Controller/Service -->> Extension Repository: findAll()
%% Responses
Extension Repository ->> Chat Controller/Service: Response stream
Chat Controller/Service ->> Chat Controller/Service: Formatted response/stream
Chat Controller/Service ->> User: "Your mama"
```

The sequence diagram above outlines the interactions between various components in the Cortex system during runtime, particularly when handling user requests via a CLI. Here’s a detailed breakdown of the runtime sequence:

Expand Down
36 changes: 35 additions & 1 deletion docs/benchmarking-architecture.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,41 @@ The benchmarking capability comprises several key components:
5. **Web Interface**: The processed data and benchmarking results are made available to users through the web interface which is the Cortex's CLI. Users can view, and convert the benchmarking reports to `JSON` from this interface.

## Implementation
![benchmark-implementation](/img/docs/benchmark-flow.png)
```mermaid
sequenceDiagram
participant User
participant Command as Command Line
participant Config as Config Loader
participant System as System Information
participant API as OpenAI API
participant Calc as Calculations
participant File as File System
User->>Command: Run script with --config option
Command->>Config: Validate config path
Config->>Config: Read and parse YAML file
Config-->>Command: Return config object
Command->>System: Get initial system resource
System-->>Command: Return initial data
loop Each Benchmark Round (num_rounds)
Command->>API: Request chat completions
API-->>Command: Stream responses
Command->>Calc: Calculate metrics (Token Count, TTFT)
Calc-->>Command: Return intermediate result
Command->>System: Get end system resources post API call
System-->>Command: Return end data
Command->>Calc: Compute resource change, Throughput, TPOT, Latency, context length
Calc-->>Command: Return comprehensive metrics
end
Command->>Calc: Aggregate result to calculate percentiles (p50, p75, p95)
Calc-->>Command: Return aggregated metrics
Command->>File: Write result to JSON (include metrics and hardware changes)
File-->>User: Save output.json
```

The diagram illustrates the implementation of how the benchmarking works in the Cortex environment:
1. **User**:
- The user runs the benchmarking command with a `--config` option.
Expand Down
61 changes: 58 additions & 3 deletions docs/cortex-onnx.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,23 @@ These are the main components that interact to provide an API for `inference` ta
### Communication Protocols

#### Load a Model
![Load Model](/img/docs/onnx-2.png)
```mermaid
sequenceDiagram
participant JS as Cortex-JS
participant CPP as Cortex-CPP
participant ONNX as Cortex.onnx

JS->>CPP: HTTP request load model
CPP->>CPP: Load engine
CPP->>ONNX: Load model
ONNX->>ONNX: Cache chat template
ONNX->>ONNX: Create onnx model
ONNX->>ONNX: Create tokenizer
ONNX-->>CPP: Callback
CPP-->>JS: HTTP response

```
The diagram above illustrates the interaction between three components: `cortex-js`, `cortex-cpp`, and `cortex.onnx` when using the `onnx` engine in Cortex:
1. **HTTP Request from cortex-js to cortex-cpp**:
Expand All @@ -122,7 +138,28 @@ The diagram above illustrates the interaction between three components: `cortex-
- `cortex-cpp` sends an HTTP response back to `cortex-js`, indicating that the model has been successfully loaded and is ready for use.
#### Stream Inference
![Stream Inference](/img/docs/onnx-3.png)
```mermaid
sequenceDiagram
participant JS as Cortex-JS
participant CPP as Cortex-CPP
participant ONNX as Cortex.onnx
JS->>CPP: HTTP request chat completion
CPP->>ONNX: Request chat completion
ONNX->>ONNX: Apply chat template
ONNX->>ONNX: Encode
ONNX->>ONNX: Set search options
ONNX->>ONNX: Create generator
loop Wait for done
ONNX->>ONNX: Compute logits
ONNX->>ONNX: Generate next token
ONNX->>ONNX: Decode new token
ONNX-->>CPP: Callback
CPP-->>JS: HTTP stream response
end
```

The diagram above illustrates the interaction between three components: `cortex-js`, `cortex-cpp`, and `cortex.onnx` when using the `onnx` engine to call the `chat completions endpoint` with the stream inference option:

1. **HTTP Request from cortex-js to `cortex-cpp`**:
Expand Down Expand Up @@ -154,7 +191,25 @@ The diagram above illustrates the interaction between three components: `cortex-
- `cortex-js` waits until the entire response is received and the process is completed.

#### Non-stream Inference
![Non-stream Inference](/img/docs/onnx-4.png)

```mermaid
sequenceDiagram
participant JS as Cortex-JS
participant CPP as Cortex-CPP
participant ONNX as Cortex.onnx
JS->>CPP: HTTP request chat completion
CPP->>ONNX: Request chat completion
ONNX->>ONNX: Apply chat template
ONNX->>ONNX: Encode
ONNX->>ONNX: Set search options
ONNX->>ONNX: Create generator
ONNX->>ONNX: Generate output
ONNX->>ONNX: Decode output
ONNX-->>CPP: Callback
CPP-->>JS: HTTP response
```
The diagram above illustrates the interaction between three components: `cortex-js`, `cortex-cpp`, and `cortex.onnx` when using the `onnx` engine to call the `chat completions endpoint` with the non-stream inference option:

1. **HTTP Request from `cortex-js` to `cortex-cpp`**:
Expand Down
42 changes: 40 additions & 2 deletions docs/cortex-tensorrt-llm.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,24 @@ These are the main components that interact to provide an API for `inference` ta
### Communication Protocols

#### Load a Model
![Load Model](/img/docs/trt-2.png)
```mermaid
sequenceDiagram
participant JS as Cortex-JS
participant CPP as Cortex-CPP
participant TRT as Cortex.tensorrt-llm

JS->>CPP: HTTP request load model
CPP->>CPP: Load model
CPP->>TRT: Load model
TRT->>TRT: Cache chat template
TRT->>TRT: Create tokenizer
TRT->>TRT: Load config
TRT->>TRT: Initialize GPT Session
TRT-->>CPP: Callback
CPP-->>JS: HTTP response

```
The diagram above illustrates the interaction between three components: `cortex-js`, `cortex-cpp`, and `cortex.tensorrt-llm` when using the `tensorrt-llm` engine in Cortex:
1. **HTTP Request Load Model (cortex-js to cortex-cpp)**:
Expand Down Expand Up @@ -151,7 +168,28 @@ The diagram above illustrates the interaction between three components: `cortex-
#### Inference
![Inference](/img/docs/trt-3.png)
```mermaid
sequenceDiagram
participant JS as Cortex-JS
participant CPP as Cortex-CPP
participant TRT as Cortex.tensorrt-llm
JS->>CPP: HTTP request chat completion
CPP->>TRT: Request chat completion
TRT->>TRT: Apply chat template
TRT->>TRT: Encode
TRT->>TRT: Set sampling config
TRT->>TRT: Create generator input/output
loop Wait for done
TRT->>TRT: Copy new token from GPU
TRT->>TRT: Decode new token
TRT-->>CPP: Callback
CPP-->>JS: HTTP stream response
end
```

The diagram above illustrates the interaction between three components: `cortex-js`, `cortex-cpp`, and `cortex.tensorrt-llm` when using the `tensorrt-llm` engine to call the `chat completions endpoint` with the inference option:

1. **HTTP Request Chat Completion (cortex-js to cortex-cpp)**:
Expand Down
16 changes: 15 additions & 1 deletion docs/telemetry-architecture.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,21 @@ The telemetry capability comprises several key components:
}
```
## Implementation
![catch-crash.png](/img/docs/telemetry2.png)
```mermaid
sequenceDiagram
participant User
participant Server as Cortex Server Main Process
participant Child as Child Process
participant Dir as Local Dir
User->>Server: Command/Request
Server->>Child: Fork child process
Server-->>Server: Catch uncaughtException, unhandledRejection event from process
Server->>Dir: Fork child process
Child->>Server: Interval ping to main process
Child->>Dir: Insert crash report
```

The diagram illustrates the implementation of handling crashes in the Cortex server environment:

1. **User Interaction**
Expand Down

0 comments on commit 603a62d

Please sign in to comment.