Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DB v6 distribution approach #2125

Open
wagoodman opened this issue Sep 17, 2024 · 4 comments
Open

DB v6 distribution approach #2125

wagoodman opened this issue Sep 17, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request planning high level epic that should be broken into smaller tasks
Milestone

Comments

@wagoodman
Copy link
Contributor

wagoodman commented Sep 17, 2024

Today the grype DB is distributed via a hosted listing.json file with URLs to DBs, listing out historical entries to N many days. There are a few points here:

  • The listing file serves two purposes: to find the latest DB and access historical DBs. The former is the primary use case of the listing file, the latter is added weight.
  • The listing file takes absolute URLs, not relative paths. This makes crafting a listing file not portable between environments, thus, needs to be rebuilt for each environment deployed to.
  • Distribution definitions are not tied to a DB schema version, and the listing contains entries for all schema versions. This prevents from being able to make breaking changes to the listing file format itself. This also requires more coordination when updating the listing file nightly (upload all databases in a fan-out, then fan in to update the listing -- ideally there would be no need to fan in [which causes lots of fun failure modes to think through]). (this is a grype-db repo concern, but motivates the changes here)
  • We do not check the DB checksum on start, mostly because this is expensive, but also there have been inefficiencies in this load path too.

Based on these points here are the suggested changes:

  • we support listings for a single DB schema only -- each new schema will be hosted in a new location.
  • replace the single listing file with two files: latest.json and history.json, split based on use case. This means that the most common use case (latest.json) is as small as possible, removing pressure from the CDN.
  • split the db.Curator by use case: DB distribution vs access to an already installed DB.
  • use xxh64 for DB checksum (not sha256), which is rather quick when checking large DB files
  • use relative URLs (relative to where the latest.json/history.json files are hosted, not absolute ones). Note: we should still be able to express absolute URLs for operational fallback positions, but this should be the exception, not the norm.

latest.json file

{
  "schemaVersion": 6,
  "status": "active",
  "archive": {
    "database": {
      "built": "2024-08-23T12:34:56Z",
      "checksum": "xxhash64:1a2b3c4d5e6f7g8h", 
      "providers": [
        {
          "name": "nvd",
          "compiled": "2024-08-23T08:00:00Z"
        },
        {
          "name": "github",
          "compiled": "2024-08-23T09:00:00Z"
        },
        ...
      ]
    },
    "path": "databases/v6/grype-db_v6_2024-08-23T11:22:22Z_1724213998.tar.gz",
    "checksum": "sha256:dd0e762e39a5905f9a622f00a361b6036c811b33bf9c5139fddaf5013db904d9"
  }
}

This file would describe only a single DB. This also combines the metadata.json and provider-metadata.json concerns (so only metadata.json needs to be packaged into the tar.

There is a status field with possible values:

  • active: the database is actively being maintained and distributed
  • deprecated: the database is still being distributed but is approaching end of life. Upgrade grype to avoid future disruptions.
  • inactive: the database is no longer being distributed. Users must build their own databases or upgrade grype.

history.json file (deferred)

{
  "schemaVersion": 6,
  "status": "active",
  "archives": [
    {
      // same entry as in latest.json for "archive"
    },
    ...
  ]
}

How these distribution files relate to one another...

Another way to look at the contained information and how it is produced/consumed:

  • metadata.json (output from grype-db build) is made up of a single “database description”... used to generate a latest.json later in the process
    • Schema version
    • Built time
    • Checksum (xxh64)
    • List of provider info (name and compiled time)
  • latest.json (output from grype-db package) is made up of a single “archive description”, schema info, and the contained “database description”... used to populate/update history.json in the future :
    • Schema version
    • Active
    • Archive
      • Path
      • Checksum (sha256)
      • Database description
        • (same as metadata.json, except schema-version is left blank)
  • history.json is an array of “archive descriptions”, but otherwise is just like latest.json

Comments / open questions

(from earlier conversations with @anchore/tools about this topic)

  • should we remove the providers data entirely from the listing use case, so that end users must query the DB for this info?
  • should we leverage the CDN for archive compression concerns? (over compressing the payload ourselves, or even packaging into a tar)
  • should we get rid of the metadata.json and require clients to get this kind of information directly from the DB?

Prototype branch for reference: https://github.com/anchore/grype/tree/db-v6-blob-store

@wagoodman wagoodman added the enhancement New feature or request label Sep 17, 2024
@wagoodman wagoodman added this to the DB v6 milestone Sep 17, 2024
@wagoodman wagoodman added the planning high level epic that should be broken into smaller tasks label Sep 17, 2024
@wagoodman wagoodman moved this to Ready in OSS Sep 17, 2024
@wagoodman wagoodman self-assigned this Sep 26, 2024
@wagoodman wagoodman moved this from Ready to In Progress in OSS Sep 26, 2024
This was referenced Sep 30, 2024
@wagoodman
Copy link
Contributor Author

wagoodman commented Nov 13, 2024

Two of the open questions have been addressed an incorporated:

should we get rid of the metadata.json and require clients to get this kind of information directly from the DB?

In one of the latest grype store PRs we've done just this. Now there is a vulnerability.db.checksum file that is generated by grype (not included in the distributed tar), but otherwise there is no metadata.json anymore.

should we remove the providers data entirely from the listing use case, so that end users must query the DB for this info?

the providers information has been removed from the latest.json

@wagoodman
Copy link
Contributor Author

features around history.json are being descoped from this effort -- we can always add this functionality later.

@wagoodman
Copy link
Contributor Author

Currently mostly implemented in anchore/grype-db#446 but cannot continue until there is prototype matching implemented

@wagoodman wagoodman added the blocked Progress is being stopped by something label Dec 2, 2024
@wagoodman wagoodman removed the blocked Progress is being stopped by something label Dec 16, 2024
@wagoodman
Copy link
Contributor Author

unblocked by #2311 -- will attempt to integrate that branch here after #2335 merges

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request planning high level epic that should be broken into smaller tasks
Projects
Status: In Progress
Development

No branches or pull requests

1 participant