Bugfix for quadratic blow-up of batch validation errors #871

mgovers · 2025-01-14T13:18:17Z

batch validation errors were appended to the same list for every scenario

This meant that if there were 2 scenarios, then the same error would be reported 2x2=4 times instead of only 2 times (once for each scenario). For 4 scenarios, that would become 16 times, etc.

Note that this may result in high costs when trying to report an error.

This was introduced in #793 (release https://github.com/PowerGridModel/power-grid-model/releases/tag/v1.9.87) in commit 3b0c6b6

Credits to @BartSchuurmans for finding this

Signed-off-by: Martijn Govers <[email protected]>

tests/unit/validation/test_batch_validation.py

Signed-off-by: Martijn Govers <[email protected]>

sonarqubecloud · 2025-01-14T14:21:04Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Jerry-Jinfeng-Guo

How does this change prevents from appending? The original code was:

        batch_errors = input_errors + id_errors

        if not id_errors:
            batch_errors = input_errors
            # ...

The new one is:

        batch_errors = input_errors + id_errors

        if not id_errors:
            # ...

tests/unit/validation/test_batch_validation.py

mgovers · 2025-01-14T14:55:08Z

How does this change prevents from appending? The original code was:

        batch_errors = input_errors + id_errors

        if not id_errors:
            batch_errors = input_errors
            # ...

The new one is:

        batch_errors = input_errors + id_errors

        if not id_errors:
            # ...

In Python, everything is a reference, except if it's not

batch_errors = input_errors makes batch_errors a reference to input_errors
batch_errors = input_errors + id_errors makes a new list with the contents of input_errors, followed by the contents of id_errors

that means that if you append to batch_errors, then in the former case, you actually append to input_errors.

If you then create a reference when adding the current scenario errors to the full list, then in the former case, every scenario will point to input_errors. In the latter, you basically move the contents of batch_errors (new list) into the full error list (technically, it's creating a reference, updating the reference count of the former, and then when it goes out of scope, the reference counter is set back to 0, which is equivalent to moving the object)

Jerry-Jinfeng-Guo · 2025-01-14T14:55:11Z

How does this change prevents from appending? The original code was:

        batch_errors = input_errors + id_errors

        if not id_errors:
            batch_errors = input_errors
            # ...

The new one is:

        batch_errors = input_errors + id_errors

        if not id_errors:
            # ...

OK, I got it now, this line duplicates the input error for all [batch] entry

Jerry-Jinfeng-Guo · 2025-01-14T15:37:50Z

When you do complexity analysis, it's important not to confuse N. In this case, the complexity is N log(N), suppose one of them is at log scale of the other one, n=LogN, i.e., number_of_batches * sum_of_errors.

mgovers · 2025-01-14T18:43:46Z

When you do complexity analysis, it's important not to confuse N. In this case, the complexity is N log(N), suppose one of them is at log scale of the other one, n=LogN, i.e., number_of_batches * sum_of_errors.

Actually no because each scenario updates the same object it references, and also a reference to the original object is used. That means that at scenario k, you also still update the errors referenced for scenario 0, and they all grow identically the same (interestingly, as a side effect, there is no additional memory overhead (linear in the amount of scenarios), but each error is referenced N times, resulting in the quadratic overhead.

Even if that were not the case, and each scenario would only contain a sum of the former, you would end up in triangle numbers, which is also quadratic: 1st scenatio references 1 error, 2nd scenario references 2 errors, ..., k references k errors. Total = $$\sum_{k=1}^n k = \frac{n(n-1)}{2}$$

mgovers · 2025-01-14T18:45:27Z

When you do complexity analysis, it's important not to confuse N. In this case, the complexity is N log(N), suppose one of them is at log scale of the other one, n=LogN, i.e., number_of_batches * sum_of_errors.

Actually no because each scenario updates the same object it references, and also a reference to the original object is used. That means that at scenario k, you also still update the errors referenced for scenario 0, and they all grow identically the same (interestingly, as a side effect, there is no additional memory overhead (linear in the amount of scenarios), but each error is referenced N times, resulting in the quadratic overhead.

Note that this also was verified experimentally by @BartSchuurmans

Jerry-Jinfeng-Guo · 2025-01-14T19:56:58Z

When you do complexity analysis, it's important not to confuse N. In this case, the complexity is N log(N), suppose one of them is at log scale of the other one, n=LogN, i.e., number_of_batches * sum_of_errors.

Actually no because each scenario updates the same object it references, and also a reference to the original object is used. That means that at scenario k, you also still update the errors referenced for scenario 0, and they all grow identically the same (interestingly, as a side effect, there is no additional memory overhead (linear in the amount of scenarios), but each error is referenced N times, resulting in the quadratic overhead.

Even if that were not the case, and each scenario would only contain a sum of the former, you would end up in triangle numbers, which is also quadratic: 1st scenatio references 1 error, 2nd scenario references 2 errors, ..., k references k errors. Total = ∑ k = 1 n k = n ( n − 1 ) 2

When looking at such problems the total amount of (non duplicate) errors is dominant, you don't (need to) look at the batch size. I am not saying NlogN is any better than quadratic, on the contrary, it might even lead to close to cubic complexity.

Jerry-Jinfeng-Guo · 2025-01-14T19:57:30Z

When you do complexity analysis, it's important not to confuse N. In this case, the complexity is N log(N), suppose one of them is at log scale of the other one, n=LogN, i.e., number_of_batches * sum_of_errors.

Actually no because each scenario updates the same object it references, and also a reference to the original object is used. That means that at scenario k, you also still update the errors referenced for scenario 0, and they all grow identically the same (interestingly, as a side effect, there is no additional memory overhead (linear in the amount of scenarios), but each error is referenced N times, resulting in the quadratic overhead.

Note that this also was verified experimentally by @BartSchuurmans

Like I said, anything between nlogn (lower case) and cubic is to be expected.

mgovers · 2025-01-15T07:10:42Z

When you do complexity analysis, it's important not to confuse N. In this case, the complexity is N log(N), suppose one of them is at log scale of the other one, n=LogN, i.e., number_of_batches * sum_of_errors.

Actually no because each scenario updates the same object it references, and also a reference to the original object is used. That means that at scenario k, you also still update the errors referenced for scenario 0, and they all grow identically the same (interestingly, as a side effect, there is no additional memory overhead (linear in the amount of scenarios), but each error is referenced N times, resulting in the quadratic overhead.

Note that this also was verified experimentally by @BartSchuurmans

Like I said, anything between nlogn (lower case) and cubic is to be expected.

How did you arrive at n = log N? I expect 2 types of data validation errors to happen in production: all scenarios suffer from the same issue (erroneous input data) or some scenarios suffer from some issue (erroneous update data). The former has a constant contribution to each scenario, and the latter is statistically distributed across the scenarios, which also contributes as a constant, but with a different prefactor.

That said: in the issue here, the amount of errors per scenario only contributes to the prefactor, not to the scaling in terms of the amount of scenarios, which is the issue described here.

Jerry-Jinfeng-Guo · 2025-01-15T07:44:47Z

How did you arrive at n = log N?

That's the only assumption we could make. I see no need to further this discussion as it does not impact the fix you did here.

fix bug where validation errors blew up quadratically

db08734

Signed-off-by: Martijn Govers <[email protected]>

mgovers added the bug Something isn't working label Jan 14, 2025

mgovers requested review from Jerry-Jinfeng-Guo, figueroa1395, petersalemink95 and nitbharambe January 14, 2025 13:18

mgovers self-assigned this Jan 14, 2025

figueroa1395 reviewed Jan 14, 2025

View reviewed changes

tests/unit/validation/test_batch_validation.py Show resolved Hide resolved

mgovers and others added 2 commits January 14, 2025 15:09

process comments

03ef877

Signed-off-by: Martijn Govers <[email protected]>

Merge branch 'main' into bugfix/blow-up-of-validation-errors

6dbd020

mgovers enabled auto-merge January 14, 2025 14:09

figueroa1395 previously approved these changes Jan 14, 2025

View reviewed changes

improve fix

197335d

Signed-off-by: Martijn Govers <[email protected]>

mgovers dismissed figueroa1395’s stale review via 197335d January 14, 2025 14:12

figueroa1395 approved these changes Jan 14, 2025

View reviewed changes

Jerry-Jinfeng-Guo reviewed Jan 14, 2025

View reviewed changes

tests/unit/validation/test_batch_validation.py Show resolved Hide resolved

mgovers added this pull request to the merge queue Jan 14, 2025

Merged via the queue into main with commit b428bda Jan 14, 2025
28 checks passed

mgovers deleted the bugfix/blow-up-of-validation-errors branch January 14, 2025 15:37

mgovers mentioned this pull request Jan 15, 2025

[FEATURE] Improve batch data validation verbosity #872

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix for quadratic blow-up of batch validation errors #871

Bugfix for quadratic blow-up of batch validation errors #871

mgovers commented Jan 14, 2025 •

edited

Loading

sonarqubecloud bot commented Jan 14, 2025

Jerry-Jinfeng-Guo left a comment

mgovers commented Jan 14, 2025

Jerry-Jinfeng-Guo commented Jan 14, 2025

Jerry-Jinfeng-Guo commented Jan 14, 2025 •

edited

Loading

mgovers commented Jan 14, 2025

mgovers commented Jan 14, 2025

Jerry-Jinfeng-Guo commented Jan 14, 2025

Jerry-Jinfeng-Guo commented Jan 14, 2025

mgovers commented Jan 15, 2025 •

edited

Loading

Jerry-Jinfeng-Guo commented Jan 15, 2025

Bugfix for quadratic blow-up of batch validation errors #871

Bugfix for quadratic blow-up of batch validation errors #871

Conversation

mgovers commented Jan 14, 2025 • edited Loading

sonarqubecloud bot commented Jan 14, 2025

Quality Gate passed

Jerry-Jinfeng-Guo left a comment

Choose a reason for hiding this comment

mgovers commented Jan 14, 2025

Jerry-Jinfeng-Guo commented Jan 14, 2025

Jerry-Jinfeng-Guo commented Jan 14, 2025 • edited Loading

mgovers commented Jan 14, 2025

mgovers commented Jan 14, 2025

Jerry-Jinfeng-Guo commented Jan 14, 2025

Jerry-Jinfeng-Guo commented Jan 14, 2025

mgovers commented Jan 15, 2025 • edited Loading

Jerry-Jinfeng-Guo commented Jan 15, 2025

mgovers commented Jan 14, 2025 •

edited

Loading

Jerry-Jinfeng-Guo commented Jan 14, 2025 •

edited

Loading

mgovers commented Jan 15, 2025 •

edited

Loading