Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter out packages that are owned by OS packages (ownership overlap) #1373

Closed
wagoodman opened this issue Jun 29, 2023 · 10 comments · Fixed by #1387
Closed

Filter out packages that are owned by OS packages (ownership overlap) #1373

wagoodman opened this issue Jun 29, 2023 · 10 comments · Fixed by #1387
Assignees
Labels
enhancement New feature or request

Comments

@wagoodman
Copy link
Contributor

What would you like to be added:
Today we conditionally filter out packages that are owned by other packages before attempting to find matches

grype/grype/pkg/package.go

Lines 104 to 144 in bc93a96

func removePackagesByOverlap(catalog *pkg.Collection, relationships []artifact.Relationship) *pkg.Collection {
byOverlap := map[artifact.ID]artifact.Relationship{}
for _, r := range relationships {
if r.Type == artifact.OwnershipByFileOverlapRelationship {
byOverlap[r.To.ID()] = r
}
}
out := pkg.NewCollection()
for p := range catalog.Enumerate() {
r, ok := byOverlap[p.ID()]
if ok {
from, ok := r.From.(pkg.Package)
if ok && excludePackage(p, from) {
continue
}
}
out.Add(p)
}
return out
}
func excludePackage(p pkg.Package, parent pkg.Package) bool {
// NOTE: we are not checking the name because we have mismatches like:
// python 3.9.2 binary
// python3.9 3.9.2-1 deb
// If the version is not effectively the same, keep both
if !strings.HasPrefix(parent.Version, p.Version) {
return false
}
// filter out only binary pkg, empty types, or equal types
if p.Type != pkg.BinaryPkg && p.Type != "" && p.Type != parent.Type {
return false
}
return true
}
.

Ideally we would remove packages that overlap in ownership with other packages when the parent package is an OS package.

Why is this needed:
This should remove a class of false positives since OS distributions typically manage back ports of fixes and are the source of truth for the logical package, thus should override any vulnerability match claims for the contained packages (e.g. if an RPM contains a python package and we find a vuln for the python package and not the RPM, we should drop this).

Dev notes:

  • we do something like this today (see the above link) so it would be ideal to stay consistent here: try to filter out the package based on artifact/relationship rules before we do any matching.
@wagoodman wagoodman added the enhancement New feature or request label Jun 29, 2023
@kzantow
Copy link
Contributor

kzantow commented Jun 29, 2023

This is related to anchore/syft#931

@willmurphyscode
Copy link
Contributor

Found an example case where I'm not sure the RPM package should have won over the python package: Language packages that are available from the OS package manager. For example, in the branch where I have this feature partly built, we can see: pkg:rpm/rhel/[email protected]?arch=noarch&upstream=python-six-1.11.0-8.el8.src.rpm&distro=rhel-8.3 is preferred over pkg:pypi/[email protected].

It's likely there because someone ran yum install -y python3-six, but any vulnerabilities reported against the PyPI version of the package should probably be reported by grype, so it's not obvious in this case that preferring the OS package is the right thing to do.

@westonsteimel
Copy link
Contributor

westonsteimel commented Jul 21, 2023

Actually these are exactly the cases where we want the os packages to win out by default because rhel backport fixes that PyPI won't know about.

@westonsteimel
Copy link
Contributor

Now something which would be even more accurate here would be to somehow deselect PyPI vulns specifically when a RHEL one overrules it, but that gets much more complicated because right now grype only knows how to fetch vulnerabilities that a package may be affected by and that approach would require fetching ones that a package is not affected by. It'd also require looking at the relatedVulnerability data to convert identifiers so that you could compare between say GHSA and CVE. It might be doable though, but probably needs alot more thought

@westonsteimel
Copy link
Contributor

Also I don't think we could do that approach today since grypedb might not even have records when something is marked as not affected or withdrawn, and those would be important to capture somewhere if attempting the more advanced approach

@willmurphyscode
Copy link
Contributor

Thanks for the additional context @westonsteimel. Based on that additional info, I'll keep trying to get #1387 merged. I may have found an interesting edge case with the quality gate, which I'll discuss on the PR.

@willmurphyscode
Copy link
Contributor

@westonsteimel and I discussed this issue this morning, and I think there are basically 2 approaches, which I'll explain with an example issue:

Example situation

syft -q docker.io/anchore/test_images@sha256:c8d664b0e728d52f57eeb98ed1899c16d3b265f02ddfb41303d7a16c31e0b0f1 | grep guava
guava                                    20.0                                            java-archive
guava20                                  20.0-8.module+el8+5161+5cac467c                 rpm

In this case, syft found 2 packages with guava in the name: the JAR guava, and the rpm guava20. They have slightly different versions. The guava20 RPM has a more specific version number, because we have the RPM installed package database to look at. But these are the same package; the Java cataloger found the JAR, and the RPM cataloger found the RPM.

Let's see what grype thinks:

grype -q docker.io/anchore/test_images@sha256:c8d664b0e728d52f57eeb98ed1899c16d3b265f02ddfb41303d7a16c31e0b0f1 | grep guava
guava                                    20.0                                                                                               java-archive  CVE-2023-2976        High
guava                                    20.0                                           24.1.1                                              java-archive  GHSA-mvr2-9pj6-7w5j  Medium
guava                                    20.0                                           32.0.0                                              java-archive  GHSA-7g45-4rm6-3mm3  Medium
guava                                    20.0                                                                                               java-archive  CVE-2018-10237       Medium
guava                                    20.0                                                                                               java-archive  GHSA-5mg8-w23w-74h3  Low
guava                                    20.0                                                                                               java-archive  CVE-2020-8908        Low

We can see that all the vulnerabilities are reported against the JAR, not the RPM. So there are two possibilities: either the RPM has a distro-specific fix that we know about, or the distro doesn't report the package as vulnerable. This brings us to the two approaches:

First approach: remove before matching

In this approach, we remove the JAR guava@20 before matching. Then none of the above vulnerabilities are found. If we believed that the distro's vulnerability feed were exhaustive, this would be a great approach. It's what's suggested earlier in this issue, and it's certainly easier to implement.

However, there are 3 possible situations:

  1. Distro feed had no record - we don't know whether we should have removed the JAR
  2. Distro feed was vulnerable - removing the JAR was fine, since we're report against the RPM, which has more specific version info anyway.
  3. Distro feed had a specific fix information - removing the JAR was correct, since the report against the JAR was a false positive caused by not considering the OS specific fix.

Second possible approach

In the second approach, we try to figure out which of the 3 possible situations we're in, and act accordingly:

  1. Distro feed has no record - leave the vuln reported against the JAR, because a false positive is better than a false negative.
  2. Distro feed shows OS package is vulnerable - remove the match against the JAR, because the vulnerability is redundant
  3. Distro feed shows OS package is fixed - remove the match against the JAR, because it is a known false positive.

I'm going to take some time to see how hard implementing the second approach is.

@willmurphyscode
Copy link
Contributor

I've done some research on the second approach, and it requires changing grype a bit.

In order to be able to correctly dedupe in this way, we need a place in the code where we have the following information:

  1. We know whether a given package is owned by an OS package, in other words, the SBOM's relationships are in scope
  2. We know whether the OS package is vulnerable, in other words, matching has taken place
  3. If the OS package is not vulnerable, we know whether it's because distro has no vulnerability record, or because the distro reports the package as not affected or patched. In other words, the vulnerability store is in scope*.

Right now, the only place where all this information is in scope is at

remainingMatches, ignoredMatches, err := vulnMatcher.FindMatches(packages, pkgContext)
, which has the matches, the relationships (on the SBOM), and the vulnerability store. However, it seems strange to have additional filtering of matches after the main vulnerabilityMatcher.FindMatches call, so I think the vulnerability matcher struct (grype.VulnerabilityMatcher) should be updated to have a pointer to the SBOM. The code should check for nil on the vulnerability matcher struct, since library users might not know to set the new field.

I'll take this specific approach and see how far it gets me:

  1. Change
    type VulnerabilityMatcher struct {
    Store store.Store
    Matchers []matcher.Matcher
    IgnoreRules []match.IgnoreRule
    FailSeverity *vulnerability.Severity
    NormalizeByCVE bool
    }
    to have a pointer to a syft sbom.SBOM on it.
  2. Change
    func (m *VulnerabilityMatcher) FindMatches(pkgs []pkg.Package, context pkg.Context) (*match.Matches, []match.IgnoredMatch, error) {
    var ignoredMatches []match.IgnoredMatch
    matches := matcher.FindMatches(m.Store, context.Distro, m.Matchers, pkgs)
    matches, ignoredMatches = m.applyIgnoreRules(matches)
    if m.NormalizeByCVE {
    normalizedMatches := match.NewMatches()
    for originalMatch := range matches.Enumerate() {
    normalizedMatches.Add(m.normalizeByCVE(originalMatch))
    }
    // we apply the ignore rules again in case any of the transformations done during normalization
    // regresses the results (relative to the already applied ignore rules). Why do we additionally apply
    // the ignore rules before normalizing? In case the user has a rule that ignores a non-normalized
    // vulnerability ID, we wantMatches to ensure that the rule is honored.
    matches, ignoredMatches = m.applyIgnoreRules(normalizedMatches)
    }
    var err error
    if m.FailSeverity != nil && HasSeverityAtOrAbove(m.Store, *m.FailSeverity, matches) {
    err = grypeerr.ErrAboveSeverityThreshold
    }
    return &matches, ignoredMatches, err
    }
    to use that reference to deduplicate as described in "Second possible approach" in my last comment

@willmurphyscode
Copy link
Contributor

Met with @wagoodman to discuss this, and we had some thoughts.

First, there are 3 logical states grype can believe about a given package/CVE combination:

  1. The package is vulnerable to the CVE
  2. There's not evidence the package is vulnerable to the CVE (e.g. search miss in Grype DB)
  3. The package is NOT vulnerable to the CVE, e.g. the distro feed records a specific patch.

Right now, grype's searching doesn't distinguish between cases 2 and 3. Case 1 gets printed, and cases 2 and 3 get dropped. However, logic distinguishing cases 2 and 3 would be very helpful.

Let's call case 3 an "anti-match." We could have a situation, for example, where someone has run yum install python3-urllib3 on RHEL8. The Python cataloger will find the PyPI package "urllib3" at some version, and the RPM cataloger will find the package "python3-urllib3", and the ownership relationship will show that python3-urllib3 owns urllib3. The python matcher will find a positive (case 1) against urllib3, because the PyPI version is listed as vulnerable. But the python3-urllib3 will find an anti-match, because the vulnerabilitiy feed from RHEL proactively documents a distro-provided patch. Then we can say, "Since this is an RPM and we're on RedHat, the anti-match from the distro feed overrides the match from the generic Python data, so this is not a positive."

This is the same logic as the second approach in #1373 (comment), but it provides a new entity that can be used in grype. So the functions in the search package will return a slice of known matches and a slice of known anti-matches, and the present of anti-matches can be used to perform business logic, especially for lowering the false positive rate. Additionally, we could display packages from grype that are known to be patched or not affected, which might save time to security teams.

@willmurphyscode
Copy link
Contributor

I think rather than block this issue on #1426, we might do the following:

  1. Have an allow-list of vulnerability feeds that are believed to be "complete." That is, vulnerability feeds that report both fixed and non-fixed vulnerabilities
  2. For OS packages that are from a distro operating one of those feeds, remove the non-OS package from the ownership-by-overlap relationship before matching.

For example, if we're scanning an RHEL8 image, and we have an RPM that owns a Python package, go ahead and remove the Python package before matching, since the RHEL8 feed will have better information on whether the package is vulnerable than GHSA will. But if we're scanning an Alpine image and there's an APK that owns a Python package, leave the python package under consideration for matching, because Alpine only reports fixes, and so the matching GHSA or NVD data directly is necessary to detect vulnerabilities that Alpine hasn't released a patch for yet. Then, when #1426 is implemented, we can remove the allow-list in favor of the logic described in the previous comment.

Next steps are:

  1. Develop this allow list
  2. Update https://github.com/anchore/grype/pull/1387/files#diff-c1eeca4b68c3cef87a8131f596f5e598d5b9a56704624ff94afd7db799e4193aR142 to only filter packages if the distro is allowed by that allow-list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants