Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add contains(countIn:where:) to Sequence #53

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mdznr
Copy link
Contributor

@mdznr mdznr commented Dec 24, 2020

Description

This addition allows the caller to compare the count of elements matching a predicate on a sequence.

If you need to compare the count of a filtered sequence, using this method can give you a performance boost over filtering the entire collection, then comparing its count:

let popUpButton: NSPopUpButton
//…
popUpButton.isEnabled = popUpButton.itemArray.contains(moreThan: 1, where: { $0.isEnabled })

Detailed Design

A function named contains(countIn:where:) added as an extension to Sequence:

extension Sequence {
  public func contains<R: RangeExpression>(
    countIn rangeExpression: R,
    where predicate: (Element) throws -> Bool
  ) rethrows -> Bool where R.Bound: FixedWidthInteger
}

Five small wrapper functions added to make working with different ranges easier and more readable at the call-site:

extension Sequence {
  public func contains(
    exactly exactCount: Int,
    where predicate: (Element) throws -> Bool
  ) rethrows -> Bool
  
  public func contains(
    atLeast minimumCount: Int,
    where predicate: (Element) throws -> Bool
  ) rethrows -> Bool
  
  public func contains(
    moreThan minimumCount: Int,
    where predicate: (Element) throws -> Bool
  ) rethrows -> Bool
  
  public func contains(
    lessThan maximumCount: Int,
    where predicate: (Element) throws -> Bool
  ) rethrows -> Bool
  
  public func contains(
    lessThanOrEqualTo maximumCount: Int,
    where predicate: (Element) throws -> Bool
  ) rethrows -> Bool
}

These wrapper functions can make a check like:

let isAlone = contributors.contains(countIn: 1...1, where: { $0.firstName == "Matt" })

more readable:

let isAlone = contributors.contains(exactly: 1, where: { $0.firstName == "Matt" })

Documentation Plan

The documentation in Guides/ContainsCountWhere.md has been updated to reflect the new functions. Inline code documentation also exists for each of the functions.

Test Plan

  • Adds tests for each of the functions
  • Adds tests for each of the examples in the documentation
  • Adds tests for handling counts of 0 and empty sequences

Source Impact

This is only additive, no existing source should be impacted.

Checklist

  • I've added at least one test that validates that my change is working, if appropriate
  • I've followed the code style of the rest of the project
  • I've read the Contribution Guidelines
  • I've updated the documentation if necessary

@mdznr mdznr force-pushed the contains_count_where branch from ed6cdf3 to 5a19dd4 Compare December 24, 2020 00:23
@mdznr mdznr marked this pull request as ready for review December 24, 2020 00:26
@mdznr mdznr requested a review from natecook1000 December 24, 2020 00:26
@xwu
Copy link
Contributor

xwu commented Dec 24, 2020

If you need to compare the count of a filtered sequence, using this method can give you a performance boost over filtering the entire collection, then comparing its count:

Have you tried lazy filtering? In general, lazy would be the go-to solution when it comes to this category of issues.

These functions can return for some infinite sequences with some predicates

The lazy facilities offered in Swift allow this as well. It can certainly be useful, but as an API innovation this isn't really an advantage, because users still have no guarantee that using your function with an infinite sequence is going to return.


I do think you've hit on something that's hard to do on the first try, but I think we can probably work on something that's a little more general which can be composed easily. It would also help to pare down the APIs needed as the pain points in question don't rely specifically on testing if a number is contained in a range.

Let's start with the premise here. Suppose I want to test if a potentially or actually infinite sequence of numbers contains at least four even numbers:

let sequence = Swift.sequence(first: 1) { $0 + 1 }
let prefix = sequence.lazy.filter { $0.isMultiple(of: 2) }.prefix(4)
let count = prefix.reduce(0) { x, _ in x + 1 }
print(count == 4)

This is certainly unwieldy but not impossible to come up with.

The only reason the third line is necessary is that we're not using the count that's available for collection types, but as discussed above, what we're doing is only really usable in situations where we're not faced with a possibly infinite sequence. So the unhappy third line here isn't really an issue.

That leaves us with the second line as something that's possible to improve upon. Is there something there that can be a nice addition for Swift Algorithms? I think maybe. On the other hand, it's really just composing filter and prefix lazily, which isn't particularly difficult to come up with or use incorrectly...

@mdznr
Copy link
Contributor Author

mdznr commented Dec 24, 2020

@xwu Thanks for taking a look at this! I think I need to improve the documentation clarifying why this is useful. While the implementation of this function isn’t terribly complex, the addition of this function not only makes it easier to read at the call-site, but helps avoid performance problems with filter(where:).count or similar variants.

My example in this PR’s description is similar to the kinds of problems I see frequently. When trying to see if there are more than n elements that match a given criteria, the natural way to do this is with something like this:

let popUpButton: NSPopUpButton

// …

// Enable the pop up button if the user actually has more than one choice.
popUpButton.isEnabled = popUpButton.itemArray.filter({ $0.isEnabled }).count > 1

This not only creates an intermediary data structure, which can be expensive if you have a lot of elements, but will call the predicate on all n items, even if after just a few, there has already been 2 or more enabled menu items.

So the next step is to use lazy for those Swift programmers familiar with it. While it’s a fantastic feature, it still doesn’t solve all of our problems in this example.

popUpButton.isEnabled = popUpButton.itemArray.lazy.filter({ $0.isEnabled }).count > 1

This still results in n comparisons.

Using reduce still results in n comparisons since it doesn’t know when to stop counting because the comparison to the count is done after its finished counting:

popUpButton.isEnabled = popUpButton.itemArray.lazy.filter({ $0.isEnabled }).reduce(0, { x, _ in x + 1 }) > 1

As far as I’m aware, there isn’t a combination of lazy, filter, prefix, and/or reduce that can avoid counting higher than necessary, resulting in more comparisons of the predicate, and often more memory.

This function counts (from the beginning) the absolute minimum number necessary to definitively return true or false when comparing the count to the given number.

popUpButton.isEnabled = popUpButton.itemArray.contains(moreThan: 1, where: { $0.isEnabled })

A sample project demonstrates this issue by counting the number of times the predicate is called for each of the alternative methods:

// Swap out this line for the one below and it will result in an infinite loop for #1, #2, and #3
let sequence = 1 ... 1_000
//let sequence = Swift.sequence(first: 1, next: { $0 + 1 })

// Keep track of how many times `predicate` is called. This is reset back to before each check for count.
var comparisons: Int = 0

let predicate = { (number: Int) -> Bool in
	comparisons += 1
//	print("predicate: \(number)")
	return number.isMultiple(of: 2)
}

var containsAtLeastFour: Bool = false

// 1. `filter.count`
comparisons = 0
containsAtLeastFour = sequence.filter(predicate).count >= 4
print("comparisons: \(comparisons); filter.count: \(containsAtLeastFour)")

// 2. `lazy.filter.count`
comparisons = 0
containsAtLeastFour = sequence.lazy.filter(predicate).count >= 4
print("comparisons: \(comparisons); lazy.filter.count: \(containsAtLeastFour)")

// 3. `lazy.filter.prefix.count`
comparisons = 0
containsAtLeastFour = sequence.lazy.filter(predicate).prefix(4).count >= 4
print("comparisons: \(comparisons); lazy.filter.prefix.count: \(containsAtLeastFour)")

// 4. `lazy.filter.prefix.reduce`
comparisons = 0
containsAtLeastFour = sequence.lazy.filter(predicate).prefix(4).reduce(0, { x, _ in x + 1 }) >= 4
print("comparisons: \(comparisons); lazy.filter.prefix.reduce: \(containsAtLeastFour)")

// 5. `contains(atLeast:where:)`
comparisons = 0
containsAtLeastFour = sequence.contains(atLeast: 4, where: predicate)
print("comparisons: \(comparisons); contains(atLeast:where:): \(containsAtLeastFour)")

// prints
// comparisons: 1000; filter.count: true
// comparisons: 1000; lazy.filter.count: true
// comparisons: 20; lazy.filter.prefix.count: true
// comparisons: 20; lazy.filter.prefix.reduce: true
// comparisons: 8; contains(atLeast:where:): true

All of the methods above except for contains(atLeast:where:) make more comparisons than necessary.


A more succinct way to describe this function is:
It’s like the contains(where:) function on Sequence on the standard library, except that it can accept any number (or range of numbers) for comparison instead of just 0.

@xwu
Copy link
Contributor

xwu commented Dec 24, 2020

// comparisons: 20; lazy.filter.prefix.count: true
// comparisons: 20; lazy.filter.prefix.reduce: true
// comparisons: 8; contains(atLeast:where:): true

All of the methods above except for contains(atLeast:where:) make more comparisons than necessary.

It's not immediately clear to me why the lazy.filter.prefix versions test the predicate twice for each element (and three times when the element is 2!). That seems suboptimal, but I would like to see that addressed holistically (that is, we should fix that for all users of these lazy facilities if it's a fixable problem).

@timvermeulen
Copy link
Contributor

I haven't given these APIs much thought yet, but here's my first impression.

It's not immediately obvious to me why all these methods need to take a predicate if .lazy.filter { ... } achieves the same thing. I could easily see this functionality being useful without a predicate, so could/should we just leave it out? Or do you suspect that most uses of these methods will want to only count the elements that satisfy some condition?

Something else I noticed about these methods is that none of them preserve the information of the exact count of the sequence, when this is known. The way I see it, the most fundamental missing functionality that these methods cover can be captured by

extension Sequence {
  /// Returns the count of the sequence as long as it does not exceed `limit`, and `nil` otherwise.
  func count(notExceeding limit: Int) -> Int?
}

The whole prefix(limit + 1).reduce(0, { count, _ in count + 1 }) == limit + 1 dance preserves this information as well, FWIW.

And finally, the method that takes a RangeExpression seems so versatile to me that I'm not sure whether we need some of the other methods at all, e.g. is atMost: 10 really better than passing ...10?

@mdznr mdznr marked this pull request as draft December 24, 2020 18:44
@mdznr mdznr removed the request for review from natecook1000 December 24, 2020 18:45
@mdznr
Copy link
Contributor Author

mdznr commented Dec 25, 2020

Or do you suspect that most uses of these methods will want to only count the elements that satisfy some condition?

The original desire was to have a function like contains(where:) but be able to pass in a different range, other than the implied 1.... Where contains(where:) returns as soon as it finds at least one element meeting the given criteria, this function returns as soon as it finds any given number.

Today, if you wanted to check if a sequence contains at least one element matching a certain criteria, you’d use:

sequence.contains(where: predicate)

If you wanted to check if a sequence contains more than one element matching a certain criteria? The most common solution I’ve seen in practice is something like:

sequence.filter(predicate).count > 1

But the performance here is less than ideal, filtering all n elements.

Programmers with a bit more familiarity with Swift will avoid the intermediary data structure from filter by using lazy, but it still filters through all n items.

If that programmer is in a priority-oriented mindset, they may use prefix in an attempt to avoid it counting higher than necessary.

I didn’t find it obvious as to why using .reduce(0, { x, _ in x + 1 }) would have a different behavior than count. I’ve been diving into this the past couple days and I’m still struggling to fully understand it (more on that below). String all these chained functions together, and a comparison against 2 isn’t anywhere near as easy as a quick check against 1 is using contains(where:):

sequence.lazy.filter(predicate).prefix(2).reduce(0, { x, _ in x + 1 }) == 2

extension Sequence {
  /// Returns the count of the sequence as long as it does not exceed `limit`, and `nil` otherwise.
  func count(notExceeding limit: Int) -> Int?
}

This is a really interesting idea. I had thought of having a kind of lazy count implementation that could return a value that can be compared to an integer but wouldn’t count higher than necessary to be able to return true or false so that you could write something more naturally like:

sequence.count(where: predicate) > 1

But it would be smart enough to do the least amount of computation necessary to return from that > comparison. But having count(where:) return something other than an Int (in order to have custom implementations for the comparison operators) feels weird.


And finally, the method that takes a RangeExpression seems so versatile to me that I'm not sure whether we need some of the other methods at all, e.g. is atMost: 10 really better than passing ...10?

I think you’re right that those extra five functions aren’t necessary, but I do find reading those are a bit nicer than using a range directly, particularly for ranges like 1...1.


I ran some tests to see the behavior of these various approaches and what I found was surprising.

I created two alternate implementations for contains(countIn:where:) that internally use lazy.filter.prefix and count/reduce(). Both of which end up calling predicate n times. However, using lazy.filter.prefix inline instead didn’t call predicate n, which is strange. Even so, it calls 2.5 times more than the contains(countIn:where:) implementation in this PR.

extension Sequence {
  @inlinable
  public func _alternateImpl_contains<R: RangeExpression>(
    countIn rangeExpression: R,
    where predicate: (Element) throws -> Bool
  ) rethrows -> Bool where R.Bound == Int {
    let range = rangeExpression.relative(to: R.Bound.zero..<R.Bound.max)
    let threshold: R.Bound
    if range.upperBound < R.Bound.max {
      threshold = range.upperBound
    } else {
      threshold = range.lowerBound
    }
    return try self.lazy.filter(predicate).prefix(threshold).count >= threshold
  }
}

extension Sequence {
  @inlinable
  public func _alternateImpl2_contains<R: RangeExpression>(
    countIn rangeExpression: R,
    where predicate: (Element) throws -> Bool
  ) rethrows -> Bool where R.Bound == Int {
    let range = rangeExpression.relative(to: R.Bound.zero..<R.Bound.max)
    let threshold: R.Bound
    if range.upperBound < R.Bound.max {
      threshold = range.upperBound
    } else {
      threshold = range.lowerBound
    }
    return try self.lazy.filter(predicate).prefix(threshold).reduce(0, { x, _ in x + 1 }) >= threshold
  }
}
let sequence = Array(1 ... 1_000)
//let sequence = 1 ... 1_000
//let sequence = 1...
//let sequence = Swift.sequence(first: 1, next: { $0 + 1 })

// Keep track of how many times `predicate` is called. This is reset back to before each check for count.
var comparisons: Int = 0

let predicate = { (number: Int) -> Bool in
  comparisons += 1
//  print("predicate: \(number)")
  return number.isMultiple(of: 5)
}

var containsAtLeastFour: Bool = false

// 1. `filter.count`
comparisons = 0
containsAtLeastFour = sequence.filter(predicate).count >= 4
print("comparisons: \(comparisons); filter.count: \(containsAtLeastFour)")

// 2. `lazy.filter.count`
comparisons = 0
containsAtLeastFour = sequence.lazy.filter(predicate).count >= 4
print("comparisons: \(comparisons); lazy.filter.count: \(containsAtLeastFour)")

// 3. `lazy.filter.prefix.count`
comparisons = 0
containsAtLeastFour = sequence.lazy.filter(predicate).prefix(4).count >= 4
print("comparisons: \(comparisons); lazy.filter.prefix.count: \(containsAtLeastFour)")

// 4. `lazy.filter.prefix.reduce`
comparisons = 0
containsAtLeastFour = sequence.lazy.filter(predicate).prefix(4).reduce(0, { x, _ in x + 1 }) >= 4
print("comparisons: \(comparisons); lazy.filter.prefix.reduce: \(containsAtLeastFour)")

// 5. `contains(atLeast:where:)`
comparisons = 0
containsAtLeastFour = sequence.contains(countIn: 4..., where: predicate)
print("comparisons: \(comparisons); contains(atLeast:where:): \(containsAtLeastFour)")

// 6. `_alternateImpl_contains(atLeast:where:)`
comparisons = 0
containsAtLeastFour = sequence._alternateImpl_contains(countIn: 4..., where: predicate)
print("comparisons: \(comparisons); _alternateImpl_contains(atLeast:where:): \(containsAtLeastFour)")

// 7. `_alternateImpl2_contains(atLeast:where:)`
comparisons = 0
containsAtLeastFour = sequence._alternateImpl2_contains(countIn: 4..., where: predicate)
print("comparisons: \(comparisons); _alternateImpl2_contains(atLeast:where:): \(containsAtLeastFour)")
// prints
// comparisons: 1000; filter.count: true
// comparisons: 1000; lazy.filter.count: true
// comparisons: 50; lazy.filter.prefix.count: true
// comparisons: 50; lazy.filter.prefix.reduce: true
// comparisons: 20; contains(atLeast:where:): true
// comparisons: 1000; _alternateImpl_contains(atLeast:where:): true
// comparisons: 1000; _alternateImpl2_contains(atLeast:where:): true

@kylemacomber
Copy link

cc @xwu

It's not immediately clear to me why the lazy.filter.prefix versions test the predicate twice for each element (and three times when the element is 2!). That seems suboptimal, but I would like to see that addressed holistically (that is, we should fix that for all users of these lazy facilities if it's a fixable problem).

Partly it's because LazyFilterCollection's startIndex is O(n):

  /// The position of the first element in a non-empty collection.
  ///
  /// In an empty collection, `startIndex == endIndex`.
  ///
  /// - Complexity: O(*n*), where *n* is the ratio between unfiltered and
  ///   filtered collection counts.
  @inlinable // lazy-performance
  public var startIndex: Index {
    var index = _base.startIndex
    while index != _base.endIndex && !_predicate(_base[index]) {
      _base.formIndex(after: &index)
    }
    return index
  }

We see prefix actually calls startIndex twice:

  @inlinable
  public __consuming func prefix(_ maxLength: Int) -> SubSequence {
    _precondition(
      maxLength >= 0,
      "Can't take a prefix of negative length from a collection")
    let end = index(startIndex,
      offsetBy: maxLength, limitedBy: endIndex) ?? endIndex
    return self[startIndex..<end]
  }

And also prefix and count (or reduce) duplicate the traversal.

In the example looking filtering for even numbers in 1...1000, here's how we get 20 comparisons:

  • prefix
    • index(_:offsetBy:limitedBy:)
      • startIndex is 2 comparisons
      • formIndex(after:) is 2 comparisons and is called 4 times -> 8 comparisons
    • self[startIndex..<end]
      • startIndex is 2 comparisons
  • count
    • formIndex(after:) is 2 comparisons and is called 4 times -> 8 comparisons

@mdznr mdznr force-pushed the contains_count_where branch from 5a19dd4 to 79b7e65 Compare January 7, 2021 04:28
@mdznr mdznr force-pushed the contains_count_where branch 2 times, most recently from 77d9b3e to a0c29a2 Compare January 16, 2021 19:10
@xwu
Copy link
Contributor

xwu commented Jan 17, 2021

@kylemacomber Thanks—that's informative.

These APIs are really meant to be composable, and I think we need to think hard about what APIs we offer strictly to squeeze out performance gains and how much that can justify the burden of adding, maintaining, and having users learn a new API.

I would propose that any APIs added for this purpose should themselves be deemed useful for further composition and, indeed, be more efficient building blocks than what we currently have. It's also worth considering whether the performance gained is a constant factor (as here) or, say, takes something from O(n^2) to O(n).

@timvermeulen
Copy link
Contributor

It's also worth considering whether the performance gained is a constant factor (as here) or, say, takes something from O(n^2) to O(n).

Only if you compare it to the prefix version that I don't expect many people to come up with — something more straight-forward like .count > k can still be slower by more than a constant factor.

@CTMacUser
Copy link
Contributor

I thought that the Sequence level is the wrong approach. Maybe we could start from the IteratorProtcol level. I wrote a core routine as an extension to iterators. However, we generally don't put new API onto IteratorProtcol; we usually take an iterator as an inout argument or as a return value. So I moved the iterator approach as a return value:

extension Sequence {

    /// Returns an iterator into the sequence that is just past the given number
    /// of elements that match the given predicate.
    ///
    /// The following example prints the value after the fifth even element of the range.
    ///
    ///     let range = -3..<20
    ///     if var iterator = range.skippingOver(count: 5, where: { $0.isMultiple(of: 2) }) {
    ///         if let nextValue = iterator.next() {
    ///             print(nextValue)
    ///         } else {
    ///             print("The range ended at its fifth even number.")
    ///         }
    ///     } else {
    ///         print("The range had less than five even numbers.")
    ///     }
    ///
    ///     // Prints: "7"
    ///
    /// - Precondition:
    ///   - `count >= 0`.
    ///   - Either the sequence is finite or at least `count` elements in the
    ///     sequence match `predicate`.
    ///
    /// - Parameters:
    ///   - count: The number of occurrences of matching elements to find before
    ///     stopping.
    ///   - predicate: A closure that takes an element of the sequence as its
    ///     argument and returns a Boolean value indicating whether the element
    ///     is a match.
    /// - Returns: If the sequence has fewer than `count` elements that match
    ///   `predicate`, then `nil`.  Otherwise, an iterator that starts with the
    ///   element after the last match.  If that last match was also the last
    ///   element, then the iterator's virtual sequence will be empty.
    ///
    /// - Complexity: O(*n*), where *n* is the length of the sequence.
    public func skippingOver(
        count: Int,
        where predicate: (Element) throws -> Bool
    ) rethrows -> Iterator? {
        var result = makeIterator()
        counter: for _ in 0..<count {
            while let element = result.next() {
                if try predicate(element) {
                    continue counter
                }
            }
            return nil
        }
        return result
    }

}

extension Sequence where Element: Equatable {

    /// Returns an iterator into the sequence that is just past the given number
    /// of elements that equal the given element.
    ///
    /// The following example prints the character after the third "a".
    ///
    ///     let sample = "Mary had a little lamb"
    ///     if var iterator = sample.skippingOver(count: 3, of: "a") {
    ///         if let nextCharacter = iterator.next() {
    ///             print("The next character is '\(nextCharacter)'.")
    ///         } else {
    ///             print("The third occurence of 'a' is at the last position.")
    ///         }
    ///     } else {
    ///         print("The occurrence count of 'a' is less than 3.")
    ///     }
    ///
    ///     // Prints: "The next character is ' '."
    ///
    /// - Precondition:
    ///   - `count >= 0`.
    ///   - Either the sequence is finite or at least `count` elements in the
    ///     sequence equal `element`.
    ///
    /// - Parameters:
    ///   - count: The number of occurrences of matching elements to find before
    ///     stopping.
    ///   - element: An element to search for in the sequence.
    /// - Returns: If the sequence has fewer than `count` elements that equal
    ///   `element`, then `nil`.  Otherwise, an iterator that starts with the
    ///   element after the last match.  If that last match was also the last
    ///   element, then the iterator's virtual sequence will be empty.
    ///
    /// - Complexity: O(*n*), where *n* is the length of the sequence.
    @inlinable
    public func skippingOver(count: Int, of element: Element) -> Iterator? {
        return skippingOver(count: count, where: { $0 == element })
    }

}

Maybe this could form a building block to a better API.

@mdznr mdznr force-pushed the contains_count_where branch from a0c29a2 to d37b94b Compare January 29, 2021 23:48
@CTMacUser
Copy link
Contributor

If you're naming one of the auxiliary methods contains(atLeast:where:), you should rename contains(lessThanOrEqualTo:where:) to contains(atMost:where:) instead.

@mdznr mdznr force-pushed the contains_count_where branch from d37b94b to a1d95d5 Compare February 16, 2021 18:14
@natecook1000
Copy link
Member

Sorry for not reviewing this sooner! I'm wondering what this looks like if we approach it from the perspective of count(where:) rather than contains(where:). I think a count(upTo:where:) method would both preserve the work done and should act as a building block for answering the rest of the calculations here.

looking for contains count
at least n contains(atLeast: n) { ... } count(upTo: n) { ... } == n
more than n contains(moreThan: n) { ... } count(upTo: n + 1) { ... } > n
exactly n contains(exactly: n) { ... } count(where: { ... }) == n
at most n contains(lessThan: n) { ... } count(upTo: n) { ... } < n

The range-based contains method could be a nice convenience, but it looks like count(upTo:) is more of the primitive method here. My hunch is that relying on a lazy.prefix approach is going to require more optimization than we can depend on to eliminate multiple traversals.

@mdznr mdznr force-pushed the contains_count_where branch from a1d95d5 to 65e90ea Compare March 13, 2021 01:09
@mdznr mdznr force-pushed the contains_count_where branch from 65e90ea to 4b68aa8 Compare April 20, 2021 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants