⚡️ Speed up `_make_forest_dict()` by 77% in `scanpy/neighbors/init.py` #2971

misrasaurabh1 · 2024-03-29T22:55:24Z

📄 `_make_forest_dict()` in `scanpy/neighbors/init.py`

📈 Performance improved by 77% (0.77x faster)

⏱️ Runtime went down from 5670.84μs to 3195.82μs

Explanation and details

I have used numpy.array and numpy.concatenate for your sizes and dat object which are much faster than numpy.fromiter and assignation respectively, especially when dealing with a large dataset. The sizes of your data_list are computed only once and used where needed. Which results in runtime improvements compared to previous code, where data sizes were computed multiple times in different parts of the code.

Correctness verification

The new optimized code was tested for correctness. The results are listed below.

✅ 8 Passed − 🌀 Generated Regression Tests

(click to show generated tests)

# imports
import numpy as np
import pytest

# function to test
# (The function definition is omitted as it was provided in the original prompt)

# helper class to create mock trees with properties
class MockTree:
    def __init__(self, hyperplanes, offsets, children, indices):
        self.hyperplanes = np.array(hyperplanes)
        self.offsets = np.array(offsets)
        self.children = np.array(children)
        self.indices = np.array(indices)

# unit tests

# Test with a single tree with one-dimensional properties
def test_single_tree_one_dimensional():
    tree = MockTree(hyperplanes=[1, 2], offsets=[3], children=[4, 5], indices=[6, 7])
    forest = [tree]
    result = _make_forest_dict(forest)
    assert result["hyperplanes"]["start"][0] == 0
    assert result["offsets"]["start"][0] == 0
    assert np.array_equal(result["hyperplanes"]["data"], tree.hyperplanes)
    assert np.array_equal(result["offsets"]["data"], tree.offsets)

# Test with multiple trees with two-dimensional properties
def test_multiple_trees_two_dimensional():
    tree1 = MockTree(hyperplanes=[[1, 2], [3, 4]], offsets=[5, 6], children=[[7, 8], [9, 10]], indices=[[11, 12], [13, 14]])
    tree2 = MockTree(hyperplanes=[[15, 16], [17, 18]], offsets=[19, 20], children=[[21, 22], [23, 24]], indices=[[25, 26], [27, 28]])
    forest = [tree1, tree2]
    result = _make_forest_dict(forest)
    assert result["hyperplanes"]["start"][0] == 0
    assert result["hyperplanes"]["start"][1] == 2
    assert np.array_equal(result["hyperplanes"]["data"], np.vstack((tree1.hyperplanes, tree2.hyperplanes)))

# Test with an empty forest
def test_empty_forest():
    forest = []
    with pytest.raises(IndexError):
        _make_forest_dict(forest)

# Test with trees having properties with zero elements
def test_trees_with_zero_elements():
    tree = MockTree(hyperplanes=[], offsets=[], children=[], indices=[])
    forest = [tree]
    result = _make_forest_dict(forest)
    assert result["hyperplanes"]["data"].size == 0

# Test with trees having properties with different data types
def test_trees_with_different_data_types():
    tree1 = MockTree(hyperplanes=[[1.0, 2.0]], offsets=[3], children=[[4, 5]], indices=[[6, 7]])
    tree2 = MockTree(hyperplanes=[[8, 9]], offsets=[10], children=[[11, 12]], indices=[[13, 14]])
    forest = [tree1, tree2]
    result = _make_forest_dict(forest)
    assert result["hyperplanes"]["data"].dtype == np.float64

# Test with trees that have properties with NaN or inf values
@pytest.mark.parametrize("value", [np.nan, np.inf, -np.inf])
def test_trees_with_special_values(value):
    tree = MockTree(hyperplanes=[[value, value]], offsets=[value], children=[[value, value]], indices=[[value, value]])
    forest = [tree]
    result = _make_forest_dict(forest)
    assert np.isnan(result["hyperplanes"]["data"]).all() or np.isinf(result["hyperplanes"]["data"]).all()

# Test with a large number of trees
def test_large_number_of_trees():
    trees = [MockTree(hyperplanes=[[i, i+1]], offsets=[i+2], children=[[i+3, i+4]], indices=[[i+5, i+6]]) for i in range(1000)]
    forest = trees
    result = _make_forest_dict(forest)
    assert len(result["hyperplanes"]["start"]) == 1000
    assert result["hyperplanes"]["data"].shape == (2000, 2)

# Test with trees missing one of the expected properties
def test_trees_missing_properties():
    class IncompleteTree:
        def __init__(self, hyperplanes, children, indices):
            self.hyperplanes = np.array(hyperplanes)
            self.children = np.array(children)
            self.indices = np.array(indices)
    tree = IncompleteTree(hyperplanes=[[1, 2]], children=[[3, 4]], indices=[[5, 6]])
    forest = [tree]
    with pytest.raises(AttributeError):
        _make_forest_dict(forest)

This optimization was discovered by Codeflash AI ⚡️

Tests included

Here is your optimized Python program. I have used `numpy.array` and `numpy.concatenate` for your sizes and dat object which are much faster than `numpy.fromiter` and assignation respectively, especially when dealing with a large dataset. The sizes of your data_list are computed only once and used where needed. Which results in runtime improvements compared to previous code, where data sizes were computed multiple times in different parts of the code.

…-24T23.13.47

codeflash-ai bot and others added 2 commits March 24, 2024 23:13

Merge branch 'main' into codeflash/optimize-_make_forest_dict-2024-03…

457861d

…-24T23.13.47

misrasaurabh1 closed this Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡️ Speed up `_make_forest_dict()` by 77% in `scanpy/neighbors/init.py` #2971

⚡️ Speed up `_make_forest_dict()` by 77% in `scanpy/neighbors/init.py` #2971

misrasaurabh1 commented Mar 29, 2024 •

edited

Loading

⚡️ Speed up _make_forest_dict() by 77% in scanpy/neighbors/__init__.py #2971

⚡️ Speed up _make_forest_dict() by 77% in scanpy/neighbors/__init__.py #2971

Conversation

misrasaurabh1 commented Mar 29, 2024 • edited Loading

📄 _make_forest_dict() in scanpy/neighbors/__init__.py

Explanation and details

Correctness verification

✅ 8 Passed − 🌀 Generated Regression Tests

⚡️ Speed up `_make_forest_dict()` by 77% in `scanpy/neighbors/init.py` #2971

⚡️ Speed up `_make_forest_dict()` by 77% in `scanpy/neighbors/init.py` #2971

misrasaurabh1 commented Mar 29, 2024 •

edited

Loading

📄 `_make_forest_dict()` in `scanpy/neighbors/init.py`