Corrected cubical cover computation #242

erooke · 2021-10-11T19:21:23Z

The cubical cover computation is implemented incorrectly. It starts with the tests having the wrong expected behavior here:

kepler-mapper/test/test_coverer.py

Lines 61 to 66 in 05adb8f

    
               def test_perc_overlap(self, CoverClass): 
        
                   """ 
        
                   2 cubes with 50% overlap and a range of [0,1] should lead to two cubes with intervals: 
        
                       [0, .75] 
        
                       [.25, 1] 
        
                   """

If you cover the interval [0,1] with 2 cubes with 50% overlape the cover you should get is [0,2/3] and [1/3, 1]. The cover [0, .75] [0.25, 1] has a percent overlap of (0.75 - 0.25)/0.75 = 0.5/0.75 = 1/3. It is worth noting that this test actually generates the cover [-0.25, 0.75] [0.25, 1.25] which does have the correct overlap but has increased the range of the cover.

The source of the error is this line:

kepler-mapper/kmapper/cover.py

Lines 190 to 192 in 05adb8f

    
           # |range| / (2n ( 1 - p)) 
        
           with np.errstate(divide='ignore'): 
        
               radius = ranges / (2 * (n_cubes) * (1 - perc_overlap))

There is an off-by-one error causing kmapper to count an intersection which doesn't exist. The computation should be ranges / ( 2 * (n _cubes - (n_cubes - 1) * perc_overlap).

There are other tests which enforce this incorrect behavior as well.

kepler-mapper/test/test_coverer.py

Lines 124 to 133 in 05adb8f

    
           def test_radius_dist(self): 
        
               test_cases = [ 
        
                   {"cubes": 1, "range": [0, 4], "overlap": 0.4, "radius": 10.0 / 3}, 
        
                   {"cubes": 1, "range": [0, 4], "overlap": 0.9, "radius": 20.0}, 
        
                   {"cubes": 2, "range": [-4, 4], "overlap": 0.5, "radius": 4.0}, 
        
                   {"cubes": 3, "range": [-4, 4], "overlap": 0.5, "radius": 2.666666666}, 
        
                   {"cubes": 10, "range": [-4, 4], "overlap": 0.5, "radius": 0.8}, 
        
                   {"cubes": 10, "range": [-4, 4], "overlap": 1.0, "radius": np.inf}, 
        
               ]

This test in particular asserts that the cover should behave strangely. The first two test cases imply that a single element cover should be impacted by the overlap percentage chosen. While the last test is basically catching the fact that the current implementation has a division by 0.

This pull request updates the tests to check for the correct behavior and updates the implementation to pass those tests. It also skips this test

kepler-mapper/test/test_coverer.py

Lines 143 to 171 in 05adb8f

    
           def test_equal_entries(self): 
        
               settings = {"cubes": 10, "overlap": 0.5} 
        
               # uniform data: 
        
               data = np.arange(0, 100) 
        
               data = data[:, np.newaxis] 
        
               lens = data 
        
               cov = Cover(settings["cubes"], settings["overlap"]) 
        
               # Prefix'ing the data with an ID column 
        
               ids = np.array([x for x in range(lens.shape[0])]) 
        
               lens = np.c_[ids, lens] 
        
               bins = cov.fit(lens) 
        
               bins = list(bins)  # extract list from generator 
        
               assert len(bins) == settings["cubes"] 
        
               cube_entries = [cov.transform_single(lens, cube) for cube in bins] 
        
               for c1, c2 in list(zip(cube_entries, cube_entries[1:]))[2:]: 
        
                   c1, c2 = c1[:, 0], c2[:, 0]  # indices only 
        
                   calced_overlap = len(set(list(c1)).intersection(set(list(c2)))) / max( 
        
                       len(c1), len(c2) 
        
                   ) 
        
                   assert calced_overlap == pytest.approx(0.5)

as what it is expecting doesn't always happen with a correct implementation. I skip it instead of deleting the test outright as the core idea behind the test may be salvageable.

codecov · 2021-10-11T19:22:51Z

Codecov Report

Merging #242 (cd0fda3) into master (ece5d47) will decrease coverage by 0.13%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #242      +/-   ##
==========================================
- Coverage   80.32%   80.18%   -0.14%     
==========================================
  Files          11       11              
  Lines         864      858       -6     
  Branches      189      195       +6     
==========================================
- Hits          694      688       -6     
  Misses        138      138              
  Partials       32       32

Impacted Files	Coverage Δ
kmapper/cover.py	`87.77% <100.00%> (-0.77%)`	⬇️

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

erooke · 2021-10-11T19:23:00Z

On a side note, I'm not sure what the correct procedure is for sending in a fix like this. Should I be opening issues before sending the pull request?

deargle · 2021-10-11T21:55:42Z

@sauln can you review this?

deargle · 2021-10-12T05:16:26Z

If you cover the interval [0,1] with 2 cubes with 50% overlape the cover you should get is [0,2/3] and [1/3, 1].

I am not a smart man, let alone an expert with TDA. So this question comes from ignorance -- is your definition/expected outcome literature-based? As far as I can tell, kmapper's implementation does work, given its approach to calculating perc_overlap and the like. Whether its approach is contrary to the field's definition is worth discussing though.

The kmapper approach -- roughly bc I'm on mobile but I worked this out again on a napkin just now -- is to first lay out centers without regard to any cube overlap. The centers initially have radii from them that make them completely cover the range, without overlap, and up to the range bounds. After the radius is calculated, perc_overlap is applied to the diameter to see how much further to extend each radius. The idea being, "a perc_overlap of .50 means that the radius of cube_1 extends halfway into the next (unextended) cube." That may be the crux of the problem you are calling out -- that the approach should consider that the other cubes are also being extended. Is that right?

But that's why the cube in that test extends into -.25 -- because the cube starts at center .25 with an unexpected radius of .25, and 50% overlap applied to its diameter means it gets another .25 on each end. That it goes beyond the bound is ignored, since centers never move w.r.t overlap, as kmapper does it.

But @sauln will have better insight.

erooke · 2021-10-12T06:40:59Z

As far as I can tell, kmapper's implementation does work, given its approach to calculating perc_overlap and the like.

Sorry saying "implemented incorrectly" might have been a bit strong on my part. The implementation is correct in the sense that it gives you a cover of your range with n-cubes which do overlap by the specified percentage. I would argue the current behavior is confusing for a handful of reasons.

First it seems to disagree with the documentation.
The Cover construction takes a limits parameter which defines the upper and lower bounds of the cover. With the current implementation the cover will exceed these limits.

from kmapper import Cover
import numpy as np
limits = np.array([[0,1]])
data = np.array([[0,0]])
cover = Cover(n_cubes=2, perc_overlap=0.5, limits = limits)
cover.fit(data)

This will again create the cover [-0.25, 0.75] and [0.25, 1.25]. Which seems to go against the idea of upper and lower bounds. Moreover the documentation states if you don't provide limits it simply computes them from the maximum and minimum values of the projected data which implies to me that the cover should not exceed the range of the projected data if no other limit is provided.

Second, it disagrees with every other mapper software I've ever used:

python mapper (if you really want to chase this down the computation happens in cover.py on line 274)
TDAmapper (here is where they compute intervals)
giotto-tda(giotto tda uses open covers and wants to cover the whole real line so their first and last intervals endpoints get shifted to ±∞ respectively but beyond that it agrees with the behavior I was expecting)

Third, it half disagrees with the original mapper paper "Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition". Although the paper makes no explicit mention of constructing a cover in this manner example 3.1 constructs a cover of the range [0,2] with length one intervals and 2/3 overlap. The cover produced is { [0,1], [1/3, 4/3], [2/3, 5/3], [1, 2] }. This is the closest I have to a literature based reason for expecting the outcome I do. If you were to try and reproduce this cover in kepler with n_cubes = 4 and perc_overlap = 2/3 you would get the cover { (-0.5, 1.0), (0.0, 1.5), (0.5, 2.0), (1.0, 2.5) }

Fourth, as a user I would not expect the percent overlap to impact a single interval cover. I would expect the Cover(n_cubes = 1, perc_overlap = 0.1) and Cover(n_cubes = 1, perc_overlap = 0.9) to be identical while here they are not.

So at the very least I believe if this behavior is to be kept it needs to be explicitly outlined in the documentation as it is not what I assumed kepler mapper was doing and I suspect others familiar with the space would be equally confused.

deargle · 2021-10-12T15:33:25Z

Thank you for the very, detailed, well-reasoned, and well-sourced answer! I'll let @sauln respond next.

And I don't think we have a formal policy yet about "open an issue first, and then a PR."

Previously cubical cover over counted the number of intersections. This commit corrects the overcounting and updates tests to check for the correct behavior.

deargle added the hacktoberfest-accepted label Oct 13, 2021

Corrected cubical cover computation

cd0fda3

Previously cubical cover over counted the number of intersections. This commit corrects the overcounting and updates tests to check for the correct behavior.

erooke force-pushed the cubical_cover branch from 7159ded to cd0fda3 Compare April 25, 2022 20:08

erooke mentioned this pull request Jun 26, 2022

Correct cubical cover computation MapperInteractive/MapperInteractive#22

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corrected cubical cover computation #242

Corrected cubical cover computation #242

erooke commented Oct 11, 2021 •

edited

Loading

codecov bot commented Oct 11, 2021 •

edited

Loading

erooke commented Oct 11, 2021

deargle commented Oct 11, 2021

deargle commented Oct 12, 2021

erooke commented Oct 12, 2021 •

edited

Loading

deargle commented Oct 12, 2021

	def test_perc_overlap(self, CoverClass):
	"""
	2 cubes with 50% overlap and a range of [0,1] should lead to two cubes with intervals:
	[0, .75]
	[.25, 1]
	"""

	# \|range\| / (2n ( 1 - p))
	with np.errstate(divide='ignore'):
	radius = ranges / (2 * (n_cubes) * (1 - perc_overlap))

	def test_radius_dist(self):

	test_cases = [
	{"cubes": 1, "range": [0, 4], "overlap": 0.4, "radius": 10.0 / 3},
	{"cubes": 1, "range": [0, 4], "overlap": 0.9, "radius": 20.0},
	{"cubes": 2, "range": [-4, 4], "overlap": 0.5, "radius": 4.0},
	{"cubes": 3, "range": [-4, 4], "overlap": 0.5, "radius": 2.666666666},
	{"cubes": 10, "range": [-4, 4], "overlap": 0.5, "radius": 0.8},
	{"cubes": 10, "range": [-4, 4], "overlap": 1.0, "radius": np.inf},
	]

	def test_equal_entries(self):
	settings = {"cubes": 10, "overlap": 0.5}

	# uniform data:
	data = np.arange(0, 100)
	data = data[:, np.newaxis]
	lens = data

	cov = Cover(settings["cubes"], settings["overlap"])

	# Prefix'ing the data with an ID column
	ids = np.array([x for x in range(lens.shape[0])])
	lens = np.c_[ids, lens]

	bins = cov.fit(lens)

	bins = list(bins) # extract list from generator

	assert len(bins) == settings["cubes"]

	cube_entries = [cov.transform_single(lens, cube) for cube in bins]

	for c1, c2 in list(zip(cube_entries, cube_entries[1:]))[2:]:
	c1, c2 = c1[:, 0], c2[:, 0] # indices only

	calced_overlap = len(set(list(c1)).intersection(set(list(c2)))) / max(
	len(c1), len(c2)
	)
	assert calced_overlap == pytest.approx(0.5)

Corrected cubical cover computation #242

Are you sure you want to change the base?

Corrected cubical cover computation #242

Conversation

erooke commented Oct 11, 2021 • edited Loading

codecov bot commented Oct 11, 2021 • edited Loading

Codecov Report

erooke commented Oct 11, 2021

deargle commented Oct 11, 2021

deargle commented Oct 12, 2021

erooke commented Oct 12, 2021 • edited Loading

deargle commented Oct 12, 2021

erooke commented Oct 11, 2021 •

edited

Loading

codecov bot commented Oct 11, 2021 •

edited

Loading

erooke commented Oct 12, 2021 •

edited

Loading