Skip to content

How to add a new custom table on ClickHouse

Huy Do edited this page Nov 19, 2024 · 23 revisions

First time login

Skip this part if you already have access to PyTorch Dev Infra ClickHouse cluster on https://console.clickhouse.cloud

For metamates, goto https://console.clickhouse.cloud and login with your Meta email. The portal uses SSO, so you just need to follow the step on your browser to request access. We grant read-only access by default.

Prepare the schema and create the table

The first thing to do is to take a look at https://clickhouse.com/docs/en/sql-reference/statements/create/table to get familiar with CH SQL syntax. The list of all databases and tables on CH is at https://github.com/pytorch/test-infra/wiki/Available-databases-on-ClickHouse.

All ClickHouse tables are backed by either S3 for immutable records like metrics or DynamoDB for mutable records like GitHub events. In both cases, you will need to:

  1. Write down your CREATE TABLE query with the schema and submit it for review, i.e. https://github.com/pytorch/test-infra/pull/5839. Once it's approved, you can create the table yourselves using CH cloud console if you have the necessary permission, or ping the reviewer to create it for you.
    1. An important note is that table backed by S3 requires a _meta Tuple(bucket String, key String) column, which is used later by the replicator to store the S3 path
  2. For immutable records on S3, follow https://github.com/pytorch/test-infra/wiki/Grant-a-workflow-permission-to-upload-files-to-S3 to make sure that the workflow that uploads the data has the permission to do so.
  3. For mutable records on DynamoDB, a similar process applies but we will need to create a new DynamoDB table instead:
    1. Submit a PR to create a new DynamoDB table at https://github.com/pytorch-labs/pytorch-gha-infra/blob/main/runners/dynamo.tf, remember to enable streaming by setting stream_enabled to true.
    2. Submit a PR to create a new OIDC role with the permission to write to the table, i.e. https://github.com/pytorch-labs/pytorch-gha-infra/pull/422.
    3. Use the new role in your workflow, i.e. https://github.com/pytorch/pytorch/pull/129544, and start writing into DynamoDB.

Update the replicator

After the records are on S3 or DynamoDB, the next step will be to update the replicator lambdas to make the them available on the corresponding CH table. There are two replicators at the moment:

  1. S3 replicator is at https://github.com/pytorch/test-infra/tree/main/aws/lambda/clickhouse-replicator-s3. https://github.com/pytorch/test-infra/pull/5921 is an example pull request that maps the S3 path ossci-benchmarks/v3/** to the corresponding CH table benchmark.oss_ci_benchmark_v3. S3 replicator, however, won't delete the immutable records on CH even when the files one S3 are deleted.
  2. Similarly, the DynamoDB replicator is at https://github.com/pytorch/test-infra/tree/main/aws/lambda/clickhouse-replicator-dynamo. This replicator handles all insert, update, and delete events from DynamoDB.

Update the trigger

This part is not yet in Terraform, so this is currently done manually via our AWS Cloud console by going to the replicator lambda (either S3 or DynamoDB) / Configuration / Triggers. Please reach out to the team if you need help here.

If you are using DynamoDB, ensure that the lambda has permissions for the table here.

Testing

Print out and manually test the insert query on CH Cloud console. Use the fortesting database to test that the insertion works.

Here is an example insert query for S3:

insert into
    benchmark.oss_ci_benchmark_v3
select
    *,
    (
        'ossci-benchmarks',
        'v3/pytorch/pytorch/11862562618/33067186757/add_loop_inductor_dynamic_gpu.json'
    ) as _meta
from
    s3(
        'https://ossci-benchmarks.s3.amazonaws.com/v3/pytorch/pytorch/11862562618/33067186757/add_loop_inductor_dynamic_gpu.json',
        'JSONEachRow',
        '
`timestamp` UInt64,
`schema_version` String,
`name` String,
`repo` String,
`head_branch` String,
`head_sha` String,
`workflow_id` UInt64,
`run_attempt` UInt32,
`job_id` UInt64,
`servicelab_experiment_id` UInt64,
`servicelab_trial_id` UInt64,
`runners` Array(
    Tuple(
        name String,
        type String,
        cpu_info String,
        cpu_count UInt32,
        mem_info String,
        avail_mem_in_gb UInt32,
        gpu_info String,
        gpu_count UInt32,
        gpu_mem_info String,
        avail_gpu_mem_in_gb UInt32,
        extra_info Map(String, String)
    )
),
`benchmark` Tuple(
    name String,
    mode String,
    dtype String,
    extra_info Map(String, String)
),
`model` Tuple (
    name String,
    type String,
    backend String,
    origins Array(String),
    extra_info Map(String, String)
),
`inputs` Map(
    String,
    Tuple(dtype String, extra_info Map(String, String))
),
`dependencies` Map(
    String,
    Tuple(
        repo String,
        branch String,
        sha String,
        version String,
        extra_info Map(String, String)
    )
),
`metric` Tuple(
    name String,
    benchmark_values Array(Float32),
    target_value Float32,
    extra_info Map(String, String)
)
',
        'gzip',
        extra_credentials(
            role_arn = 'arn:aws:iam::308535385114:role/clickhouse_role'
        )
    )

Some notes:

  • A _meta column is used in the select statement to store the S3 path.
  • No comma at the end of the schema string, CH doesn't like that at all.
  • Use gzip if the file on S3 is compressed, otherwise, use none for plain text.