Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrected cubical cover computation #242

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

erooke
Copy link
Contributor

@erooke erooke commented Oct 11, 2021

The cubical cover computation is implemented incorrectly. It starts with the tests having the wrong expected behavior here:

def test_perc_overlap(self, CoverClass):
"""
2 cubes with 50% overlap and a range of [0,1] should lead to two cubes with intervals:
[0, .75]
[.25, 1]
"""

If you cover the interval [0,1] with 2 cubes with 50% overlape the cover you should get is [0,2/3] and [1/3, 1]. The cover [0, .75] [0.25, 1] has a percent overlap of (0.75 - 0.25)/0.75 = 0.5/0.75 = 1/3. It is worth noting that this test actually generates the cover [-0.25, 0.75] [0.25, 1.25] which does have the correct overlap but has increased the range of the cover.

The source of the error is this line:

# |range| / (2n ( 1 - p))
with np.errstate(divide='ignore'):
radius = ranges / (2 * (n_cubes) * (1 - perc_overlap))

There is an off-by-one error causing kmapper to count an intersection which doesn't exist. The computation should be ranges / ( 2 * (n _cubes - (n_cubes - 1) * perc_overlap).

There are other tests which enforce this incorrect behavior as well.

def test_radius_dist(self):
test_cases = [
{"cubes": 1, "range": [0, 4], "overlap": 0.4, "radius": 10.0 / 3},
{"cubes": 1, "range": [0, 4], "overlap": 0.9, "radius": 20.0},
{"cubes": 2, "range": [-4, 4], "overlap": 0.5, "radius": 4.0},
{"cubes": 3, "range": [-4, 4], "overlap": 0.5, "radius": 2.666666666},
{"cubes": 10, "range": [-4, 4], "overlap": 0.5, "radius": 0.8},
{"cubes": 10, "range": [-4, 4], "overlap": 1.0, "radius": np.inf},
]

This test in particular asserts that the cover should behave strangely. The first two test cases imply that a single element cover should be impacted by the overlap percentage chosen. While the last test is basically catching the fact that the current implementation has a division by 0.

This pull request updates the tests to check for the correct behavior and updates the implementation to pass those tests. It also skips this test

def test_equal_entries(self):
settings = {"cubes": 10, "overlap": 0.5}
# uniform data:
data = np.arange(0, 100)
data = data[:, np.newaxis]
lens = data
cov = Cover(settings["cubes"], settings["overlap"])
# Prefix'ing the data with an ID column
ids = np.array([x for x in range(lens.shape[0])])
lens = np.c_[ids, lens]
bins = cov.fit(lens)
bins = list(bins) # extract list from generator
assert len(bins) == settings["cubes"]
cube_entries = [cov.transform_single(lens, cube) for cube in bins]
for c1, c2 in list(zip(cube_entries, cube_entries[1:]))[2:]:
c1, c2 = c1[:, 0], c2[:, 0] # indices only
calced_overlap = len(set(list(c1)).intersection(set(list(c2)))) / max(
len(c1), len(c2)
)
assert calced_overlap == pytest.approx(0.5)

as what it is expecting doesn't always happen with a correct implementation. I skip it instead of deleting the test outright as the core idea behind the test may be salvageable.

@codecov
Copy link

codecov bot commented Oct 11, 2021

Codecov Report

Merging #242 (cd0fda3) into master (ece5d47) will decrease coverage by 0.13%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #242      +/-   ##
==========================================
- Coverage   80.32%   80.18%   -0.14%     
==========================================
  Files          11       11              
  Lines         864      858       -6     
  Branches      189      195       +6     
==========================================
- Hits          694      688       -6     
  Misses        138      138              
  Partials       32       32              
Impacted Files Coverage Δ
kmapper/cover.py 87.77% <100.00%> (-0.77%) ⬇️

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@erooke
Copy link
Contributor Author

erooke commented Oct 11, 2021

On a side note, I'm not sure what the correct procedure is for sending in a fix like this. Should I be opening issues before sending the pull request?

@deargle
Copy link
Collaborator

deargle commented Oct 11, 2021

@sauln can you review this?

@deargle
Copy link
Collaborator

deargle commented Oct 12, 2021

If you cover the interval [0,1] with 2 cubes with 50% overlape the cover you should get is [0,2/3] and [1/3, 1].

I am not a smart man, let alone an expert with TDA. So this question comes from ignorance -- is your definition/expected outcome literature-based? As far as I can tell, kmapper's implementation does work, given its approach to calculating perc_overlap and the like. Whether its approach is contrary to the field's definition is worth discussing though.

The kmapper approach -- roughly bc I'm on mobile but I worked this out again on a napkin just now -- is to first lay out centers without regard to any cube overlap. The centers initially have radii from them that make them completely cover the range, without overlap, and up to the range bounds. After the radius is calculated, perc_overlap is applied to the diameter to see how much further to extend each radius. The idea being, "a perc_overlap of .50 means that the radius of cube_1 extends halfway into the next (unextended) cube." That may be the crux of the problem you are calling out -- that the approach should consider that the other cubes are also being extended. Is that right?

But that's why the cube in that test extends into -.25 -- because the cube starts at center .25 with an unexpected radius of .25, and 50% overlap applied to its diameter means it gets another .25 on each end. That it goes beyond the bound is ignored, since centers never move w.r.t overlap, as kmapper does it.

But @sauln will have better insight.

@erooke
Copy link
Contributor Author

erooke commented Oct 12, 2021

As far as I can tell, kmapper's implementation does work, given its approach to calculating perc_overlap and the like.

Sorry saying "implemented incorrectly" might have been a bit strong on my part. The implementation is correct in the sense that it gives you a cover of your range with n-cubes which do overlap by the specified percentage. I would argue the current behavior is confusing for a handful of reasons.

First it seems to disagree with the documentation.
The Cover construction takes a limits parameter which defines the upper and lower bounds of the cover. With the current implementation the cover will exceed these limits.

from kmapper import Cover
import numpy as np
limits = np.array([[0,1]])
data = np.array([[0,0]])
cover = Cover(n_cubes=2, perc_overlap=0.5, limits = limits)
cover.fit(data)

This will again create the cover [-0.25, 0.75] and [0.25, 1.25]. Which seems to go against the idea of upper and lower bounds. Moreover the documentation states if you don't provide limits it simply computes them from the maximum and minimum values of the projected data which implies to me that the cover should not exceed the range of the projected data if no other limit is provided.

Second, it disagrees with every other mapper software I've ever used:

  • python mapper (if you really want to chase this down the computation happens in cover.py on line 274)
  • TDAmapper (here is where they compute intervals)
  • giotto-tda(giotto tda uses open covers and wants to cover the whole real line so their first and last intervals endpoints get shifted to ±∞ respectively but beyond that it agrees with the behavior I was expecting)

Third, it half disagrees with the original mapper paper "Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition". Although the paper makes no explicit mention of constructing a cover in this manner example 3.1 constructs a cover of the range [0,2] with length one intervals and 2/3 overlap. The cover produced is { [0,1], [1/3, 4/3], [2/3, 5/3], [1, 2] }. This is the closest I have to a literature based reason for expecting the outcome I do. If you were to try and reproduce this cover in kepler with n_cubes = 4 and perc_overlap = 2/3 you would get the cover { (-0.5, 1.0), (0.0, 1.5), (0.5, 2.0), (1.0, 2.5) }

Fourth, as a user I would not expect the percent overlap to impact a single interval cover. I would expect the Cover(n_cubes = 1, perc_overlap = 0.1) and Cover(n_cubes = 1, perc_overlap = 0.9) to be identical while here they are not.

So at the very least I believe if this behavior is to be kept it needs to be explicitly outlined in the documentation as it is not what I assumed kepler mapper was doing and I suspect others familiar with the space would be equally confused.

@deargle
Copy link
Collaborator

deargle commented Oct 12, 2021

Thank you for the very, detailed, well-reasoned, and well-sourced answer! I'll let @sauln respond next.

And I don't think we have a formal policy yet about "open an issue first, and then a PR."

Previously cubical cover over counted the number of intersections. This
commit corrects the overcounting and updates tests to check for the
correct behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants