Skip to content

Latest commit

 

History

History
53 lines (46 loc) · 1.49 KB

README.md

File metadata and controls

53 lines (46 loc) · 1.49 KB

gh-archive-clickhouse

Save public GitHub event stream to ClickHouse as json.

CREATE TABLE github_events_raw
(
    id  Int64,
    ts  DateTime32,
    raw String CODEC (ZSTD(16))
) ENGINE = ReplacingMergeTree
      PARTITION BY toYYYYMMDD(ts)
      ORDER BY (ts, id);

Alternative to gharchive crawler with decreased probability to miss events.

  • Streaming to ClickHouse via native protocol instead of using files, so storage and fethching are decoupled.
  • Automatic pagination if more than one page of new events is available
  • Automatic fetch rate adjustment based on rate limit github headers and request duration
  • ETag support to skip cached results

gh-load

$ gh-load --help
Usage of gh-load:
  -b int
    	max token bytes (default 104857600)
  -db string
    	db
  -from string
    	start from (default "2022-03-14T21")
  -host string
    	host (default "localhost")
  -jobs int
    	jobs (default 1)
  -port int
    	port (default 9000)
  -table string
    	table (default "github_events_raw")
  -to string
    	end with (default "2022-04-14T21")

Example usage (consumes ~340MB RAM per job):

gh-load --db faster --from 2020-01-01T00 --to 2020-01-20T00 --jobs 20