Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datastore: Add vuln & enrich stream updates #1315

Merged
merged 2 commits into from
May 3, 2024
Merged

Conversation

jvdm
Copy link
Contributor

@jvdm jvdm commented Apr 11, 2024

Add iterator-based vulnerability/enrichment writes to the existing DB APIs.

@jvdm jvdm force-pushed the streaming branch 2 times, most recently from e08f257 to e2c4a0f Compare April 18, 2024 23:00
Copy link

codecov bot commented Apr 18, 2024

Codecov Report

Attention: Patch coverage is 31.57895% with 52 lines in your changes are missing coverage. Please review.

Project coverage is 55.76%. Comparing base (5ae53b8) to head (0831b93).

Files Patch % Lines
datastore/postgres/enrichment.go 0.00% 28 Missing ⚠️
datastore/postgres/updatevulnerabilities.go 54.54% 14 Missing and 6 partials ⚠️
libvuln/jsonblob/jsonblob.go 0.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1315      +/-   ##
==========================================
- Coverage   55.84%   55.76%   -0.09%     
==========================================
  Files         266      266              
  Lines       16704    16759      +55     
==========================================
+ Hits         9328     9345      +17     
- Misses       6411     6445      +34     
- Partials      965      969       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jvdm jvdm marked this pull request as ready for review April 19, 2024 16:50
@jvdm jvdm requested a review from a team as a code owner April 19, 2024 16:50
@jvdm jvdm requested review from hdonnay and removed request for a team April 19, 2024 16:50
Copy link
Member

@hdonnay hdonnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments on specific bits. I'd also prefer commits for Enrichment and Vulnerability changes individually.

Comment on lines 18 to 21
"github.com/quay/claircore/datastore"
"github.com/quay/zlog"

"github.com/quay/claircore/libvuln/driver"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix grouping.

Comment on lines 15 to 21
"github.com/quay/zlog"

"github.com/quay/claircore"
"github.com/quay/claircore/datastore"
"github.com/quay/claircore/libvuln/driver"
"github.com/quay/claircore/pkg/microbatch"
"github.com/quay/zlog"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix grouping.

datastore/enrichment.go Outdated Show resolved Hide resolved
Comment on lines 48 to 51
type deltaOpts struct {
deletedVulns []string
vulns []*claircore.Vulnerability
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike "options structs" as a style issue. I think a deleted Iter[string] would be fine in the "internal" signature. I can harass @crozzy into helping bash that into shape -- just pull the "delta" changes into a commit that we can mess with.

Copy link
Contributor Author

@jvdm jvdm Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hdonnay:

I dislike "options structs" as a style issue.

What are the pain points, in your opinion?

I think a deleted Iter[string] would be fine in the "internal" signature.

The internal method needs all vulns upfront when delete=true. It makes things a little less straightforward to change if I have to iterate over the vulns when delete=true to handle the deletion steps first (they come first in the update logic) to decide then not to iterate if delete=false and iterate if delete=false.

The proposed patch seems the most straightforward way to achieve what we want; Considering we are sunsetting this interface sometime in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, missed the fact that the iterator for this particular call can be called twice. For the internal API I think this assumption is fair. I am going to update this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the pain points, in your opinion?

They're generally used in APIs to retain source-level compatibility when adding API surface, but I think that's better implemented the way this package had it: add a new function that has the new arguments, then implement the "old" API on top of it. The other reason people seem to use them is to keep a function signature "manageable," but I find that means it's very easy to not notice when a function is gaining unique code paths. If a function has too many arguments, that's good signal of either irreducible complexity in the current design or a need to refactor the function, depending on whether refactoring is possible. Either way, an options struct sort of obscures what a 200-character long function signature makes plain.

I think they can be fine when an API really does want to make the user "fill out a form" or as an alternative to forcing a user to write boilerplate, but I don't think that applies here.

libvuln/jsonblob/jsonblob.go Show resolved Hide resolved
@@ -455,6 +530,16 @@ func (s *Store) DeltaUpdateVulnerabilities(ctx context.Context, updater string,
return uuid.Nil, nil
}

// UpdateEnrichmentsIter is unimplemented.
func (s *Store) UpdateEnrichmentsIter(_ context.Context, _ string, _ driver.Fingerprint, _ datastore.EnrichmentIter) (uuid.UUID, error) {
return uuid.Nil, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have this return something wrapping errors.ErrUnsupported.


// UpdateVulnerabilitiesIter is unimplemented.
func (s *Store) UpdateVulnerabilitiesIter(_ context.Context, _ string, _ driver.Fingerprint, _ datastore.VulnIter) (uuid.UUID, error) {
return uuid.Nil, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have this return something wrapping errors.ErrUnsupported.

@jvdm jvdm force-pushed the streaming branch 3 times, most recently from bb0672e to 0feda8e Compare April 19, 2024 23:23
@jvdm jvdm requested a review from hdonnay April 19, 2024 23:39
@@ -4,17 +4,22 @@ import (
"context"

"github.com/google/uuid"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whitespace pedantry: this line should be here.

Comment on lines 3 to 8
// Iter is an iterator function that accepts a callback 'yield' to handle each
// iterator item. The consumer can signal the iterator to break or retry by
// returning an error. The iterator itself returns an error if the iteration
// cannot continue or was interrupted unexpectedly.
type Iter[T any] func(yield func(T) error) error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to keep this aligned to the rangefunc experiment syntax (which looks likely to land in Go 1.23), so I'd rather we remove the error returns:

Suggested change
// Iter is an iterator function that accepts a callback 'yield' to handle each
// iterator item. The consumer can signal the iterator to break or retry by
// returning an error. The iterator itself returns an error if the iteration
// cannot continue or was interrupted unexpectedly.
type Iter[T any] func(yield func(T) error) error
// Iter is an iterator function that accepts a callback 'yield' to handle each
// iterator item. The consumer can signal the iterator to break or retry by
// returning an error. The iterator itself returns an error if the iteration
// cannot continue or was interrupted unexpectedly.
type Iter[T any] func(yield func(T, error) bool)

Alternatively, if you feel strongly about always communicating the error, we could expand this a little:

Suggested change
// Iter is an iterator function that accepts a callback 'yield' to handle each
// iterator item. The consumer can signal the iterator to break or retry by
// returning an error. The iterator itself returns an error if the iteration
// cannot continue or was interrupted unexpectedly.
type Iter[T any] func(yield func(T) error) error
type Iter[T any] func(yield func(T, error) bool)
type IntoIter[T any] func(/*ctx?*/) (Iter[T], func() error)

(This idea stolen from here.)
And then the using code needs to call the second function (and the compiler should complain if that doesn't happen) and the implementer just needs to keep an error around to return for the second function to return.

Copy link
Contributor Author

@jvdm jvdm Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for the context. Yielding errors suffice. I've updater the Iter interface. Here's an example of what users could do using the jsonblob.Iter:

count := 0
it := loader.NextIter()
ref, err = u.store.UpdateVulnerabilitiesIter(ctx, it.Updater, it.Fingerprint, func(yield 
func(*claircore.Vulnerability, error) bool) {
	for it.Next() && yield(it.Vulnerability, nil) {
		count++
	}
	if loader.Err() != nil {
		_ = yield(nil, loader.Err())
	}
})

@jvdm jvdm force-pushed the streaming branch 3 times, most recently from 1a52476 to 81f00ae Compare April 26, 2024 18:39
@jvdm jvdm requested a review from hdonnay April 26, 2024 18:39
Copy link
Member

@hdonnay hdonnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Datastore changes look good, especially once the restriction on the iterator needing to restart is documented.

The jsonblob changes might make more sense if implemented on a new type, instead of having already tricksy code now have two access patterns intertwined.

Comment on lines +25 to +27
// UpdateVulnerabilitiesIter performs the same operation as
// UpdateVulnerabilities, but accepting an iterator function.
UpdateVulnerabilitiesIter(ctx context.Context, updater string, fingerprint driver.Fingerprint, vulnIter VulnerabilityIter) (uuid.UUID, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// UpdateVulnerabilitiesIter performs the same operation as
// UpdateVulnerabilities, but accepting an iterator function.
UpdateVulnerabilitiesIter(ctx context.Context, updater string, fingerprint driver.Fingerprint, vulnIter VulnerabilityIter) (uuid.UUID, error)
// UpdateVulnerabilitiesIter performs the same operation as
// UpdateVulnerabilities, but accepting an iterator function.
//
// The passed iterator must be able to iterate over the sequence multiple times.
UpdateVulnerabilitiesIter(ctx context.Context, updater string, fingerprint driver.Fingerprint, vulnIter VulnerabilityIter) (uuid.UUID, error)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// The passed iterator must be able to iterate over the sequence multiple times.

That's not true for UpdateVulnerabilitiesIter. It would be for a potential DeltaUpdateVulnerabilitiesIter, but this patch does not aim to support that.

@@ -117,6 +144,54 @@ func (l *Loader) Entry() *Entry {
return l.e
}

// Iter returns an iterator for read all objects for the current update operation.
func (l *Loader) Iter() *iter {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't return an unexported type.

It seems like it should be returning the equivalent of iter.Seq2[Entry, err], documenting that the returned Entry objects contain a single Enrichment or Vulnerability, and calling (*Loader).Iter() is mutually exclusive with calling (*Loader).Entry().

Alternatively, maybe just adding a StreamingLoader or IterLoader with a better/different interface instead of intertwining this with Loader would prevent the possibility of misuse. It could even return something else instead of an Entry to avoid the subtlety of that type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't return an unexported type.

My goal is to prevent code from outside the package to create iter objects. I need clarification of why this should not be an unexported type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, maybe just adding a StreamingLoader or IterLoader with a better/different interface instead of intertwining this with Loader would prevent the possibility of misuse.

That's a valid point.

Considering the Loader already uses the Next()/Entry()/Err() pattern, and considering we don't want to break backward compatibility, it seems less effort to continue using the Loader internals vs. a new type. We would want the existing Loader to use the StreamingLoader anyway (if not, it's one additional code path for this, and I would rather avoid).

So we have the current proposal, where the Loader supports two iteration patterns:

  1. Calling Next() + Entry() to retrieve batches per updater.
  2. Calling NextIter() + Iter() to retrieve individual items per updater.

Or your proposal:

It seems like it should be returning the equivalent of iter.Seq2[Entry, err], documenting that the returned Entry objects contain a single Enrichment or Vulnerability, and calling (*Loader).Iter() is mutually exclusive with calling (*Loader).Entry().

Let's ignore rangefunc for now -- which I don't think it worth doing anyway.

That option implies we would reuse Next() for both (1.) and (2.) in the previous proposal. It lets the user call Next() once and decide if either return the whole batch with Entry() or an iterator using Iter(). But, that's not possible unless I change the contract of Entry() dictating the data is already available when Next() == true. I would have to either return nil (which I don't think the current contract allows) or partial data (which could result in misuse before the caller checks things failed calling Loader.Err()).

A third option is to pass an option to the Loader, which would change the behavior of Next() and Entry() to be of NextIter() and Iter() (we would reuse Entry to iterate over individual items instead of populating the slices). That's convoluted.

So, the two iterator patterns (the current approach) balance out the need for functionality, minimal changes, and test coverage. I want to maintain backward compatibility, avoid regressions, and add a new access pattern, at the expense of adding two access patterns to the Loader API. But I think that's OK because that API will die soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quoting myself:

Let's ignore rangefunc for now -- which I don't think is worth doing anyway.

I lied. I played a bit with the idea today. The implementation turned out simpler, albeit a bit unusual, because I went ahead and used nested iterators. @hdonnay, let me know if you would prefer to pursue that instead d2531cb

By the way, we would have to keep the previous Next()/Entry()/Err() implementation as is for now, so there would be two paths for the same thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}
}
delVulns := func(yield func(string, error) bool) {
for _, s := range deletedVulns {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other implementations above used the index i instead. Curious why the change here?

}

func (s *MatcherStore) updateVulnerabilities(ctx context.Context, updater string, fingerprint driver.Fingerprint, vulns []*claircore.Vulnerability, deletedVulns []string, delta bool) (uuid.UUID, error) {
func (s *MatcherStore) updateVulnerabilities(ctx context.Context, updater string, fingerprint driver.Fingerprint, vulnIter datastore.VulnerabilityIter, delIter datastore.Iter[string]) (uuid.UUID, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delIter is a bit confusing to me because it makes me think it's short for delta instead of delete. Can we make it more explicit and call it deleteIter?

//
// Users are expected to call [*iter.Next()] to read and parse the next object,
// which can be retrieved using [iter.Vulnerability] or [iter.Enrichment] based
// on the update kind. Errors are reported in the [Loader.Err].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think other comments use a single space after a period

@@ -139,6 +167,7 @@ func (s *MatcherStore) updateVulnerabilities(ctx context.Context, updater string
return uuid.Nil, fmt.Errorf("failed to create update_operation: %w", err)
}

delta := delIter != nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this used?

@@ -181,18 +210,20 @@ func (s *MatcherStore) updateVulnerabilities(ctx context.Context, updater string
}

if len(oldVulns) > 0 {
for _, v := range vulns {
vulnIter(func(v *claircore.Vulnerability, _ error) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this and delIter make assumptions about the passed in err that it's safe to ignore?

@jvdm jvdm requested a review from hdonnay April 30, 2024 05:30
@jvdm jvdm force-pushed the streaming branch 4 times, most recently from fb99454 to 4895118 Compare May 3, 2024 18:48
Copy link
Member

@hdonnay hdonnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked a little on Slack and decided split out the "provider" part of jsonblob.

All this LGTM.

@hdonnay
Copy link
Member

hdonnay commented May 3, 2024

/fast-forward

@github-actions github-actions bot merged commit 0831b93 into quay:main May 3, 2024
8 checks passed
@crozzy crozzy added needs-changelog Label for PRs that need a changelog note. and removed needs-changelog Label for PRs that need a changelog note. labels May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants