Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚡️ Speed up _make_forest_dict() by 77% in scanpy/neighbors/__init__.py #2971

Conversation

misrasaurabh1
Copy link

@misrasaurabh1 misrasaurabh1 commented Mar 29, 2024

📄 _make_forest_dict() in scanpy/neighbors/__init__.py

📈 Performance improved by 77% (0.77x faster)

⏱️ Runtime went down from 5670.84μs to 3195.82μs

Explanation and details

I have used numpy.array and numpy.concatenate for your sizes and dat object which are much faster than numpy.fromiter and assignation respectively, especially when dealing with a large dataset. The sizes of your data_list are computed only once and used where needed. Which results in runtime improvements compared to previous code, where data sizes were computed multiple times in different parts of the code.

Correctness verification

The new optimized code was tested for correctness. The results are listed below.

✅ 8 Passed − 🌀 Generated Regression Tests

(click to show generated tests)
# imports
import numpy as np
import pytest

# function to test
# (The function definition is omitted as it was provided in the original prompt)

# helper class to create mock trees with properties
class MockTree:
    def __init__(self, hyperplanes, offsets, children, indices):
        self.hyperplanes = np.array(hyperplanes)
        self.offsets = np.array(offsets)
        self.children = np.array(children)
        self.indices = np.array(indices)

# unit tests

# Test with a single tree with one-dimensional properties
def test_single_tree_one_dimensional():
    tree = MockTree(hyperplanes=[1, 2], offsets=[3], children=[4, 5], indices=[6, 7])
    forest = [tree]
    result = _make_forest_dict(forest)
    assert result["hyperplanes"]["start"][0] == 0
    assert result["offsets"]["start"][0] == 0
    assert np.array_equal(result["hyperplanes"]["data"], tree.hyperplanes)
    assert np.array_equal(result["offsets"]["data"], tree.offsets)

# Test with multiple trees with two-dimensional properties
def test_multiple_trees_two_dimensional():
    tree1 = MockTree(hyperplanes=[[1, 2], [3, 4]], offsets=[5, 6], children=[[7, 8], [9, 10]], indices=[[11, 12], [13, 14]])
    tree2 = MockTree(hyperplanes=[[15, 16], [17, 18]], offsets=[19, 20], children=[[21, 22], [23, 24]], indices=[[25, 26], [27, 28]])
    forest = [tree1, tree2]
    result = _make_forest_dict(forest)
    assert result["hyperplanes"]["start"][0] == 0
    assert result["hyperplanes"]["start"][1] == 2
    assert np.array_equal(result["hyperplanes"]["data"], np.vstack((tree1.hyperplanes, tree2.hyperplanes)))

# Test with an empty forest
def test_empty_forest():
    forest = []
    with pytest.raises(IndexError):
        _make_forest_dict(forest)

# Test with trees having properties with zero elements
def test_trees_with_zero_elements():
    tree = MockTree(hyperplanes=[], offsets=[], children=[], indices=[])
    forest = [tree]
    result = _make_forest_dict(forest)
    assert result["hyperplanes"]["data"].size == 0

# Test with trees having properties with different data types
def test_trees_with_different_data_types():
    tree1 = MockTree(hyperplanes=[[1.0, 2.0]], offsets=[3], children=[[4, 5]], indices=[[6, 7]])
    tree2 = MockTree(hyperplanes=[[8, 9]], offsets=[10], children=[[11, 12]], indices=[[13, 14]])
    forest = [tree1, tree2]
    result = _make_forest_dict(forest)
    assert result["hyperplanes"]["data"].dtype == np.float64

# Test with trees that have properties with NaN or inf values
@pytest.mark.parametrize("value", [np.nan, np.inf, -np.inf])
def test_trees_with_special_values(value):
    tree = MockTree(hyperplanes=[[value, value]], offsets=[value], children=[[value, value]], indices=[[value, value]])
    forest = [tree]
    result = _make_forest_dict(forest)
    assert np.isnan(result["hyperplanes"]["data"]).all() or np.isinf(result["hyperplanes"]["data"]).all()

# Test with a large number of trees
def test_large_number_of_trees():
    trees = [MockTree(hyperplanes=[[i, i+1]], offsets=[i+2], children=[[i+3, i+4]], indices=[[i+5, i+6]]) for i in range(1000)]
    forest = trees
    result = _make_forest_dict(forest)
    assert len(result["hyperplanes"]["start"]) == 1000
    assert result["hyperplanes"]["data"].shape == (2000, 2)

# Test with trees missing one of the expected properties
def test_trees_missing_properties():
    class IncompleteTree:
        def __init__(self, hyperplanes, children, indices):
            self.hyperplanes = np.array(hyperplanes)
            self.children = np.array(children)
            self.indices = np.array(indices)
    tree = IncompleteTree(hyperplanes=[[1, 2]], children=[[3, 4]], indices=[[5, 6]])
    forest = [tree]
    with pytest.raises(AttributeError):
        _make_forest_dict(forest)

This optimization was discovered by Codeflash AI ⚡️

  • Tests included

codeflash-ai bot and others added 2 commits March 24, 2024 23:13
Here is your optimized Python program.


I have used `numpy.array` and `numpy.concatenate` for your sizes and dat object which are much faster than `numpy.fromiter` and assignation respectively, especially when dealing with a large dataset. The sizes of your data_list are computed only once and used where needed. Which results in runtime improvements compared to previous code, where data sizes were computed multiple times in different parts of the code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant