Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix for quadratic blow-up of batch validation errors #871

Merged
merged 4 commits into from
Jan 14, 2025

Conversation

mgovers
Copy link
Member

@mgovers mgovers commented Jan 14, 2025

batch validation errors were appended to the same list for every scenario

This meant that if there were 2 scenarios, then the same error would be reported 2x2=4 times instead of only 2 times (once for each scenario). For 4 scenarios, that would become 16 times, etc.

Note that this may result in high costs when trying to report an error.

This was introduced in #793 (release https://github.com/PowerGridModel/power-grid-model/releases/tag/v1.9.87) in commit 3b0c6b6

Credits to @BartSchuurmans for finding this

@mgovers mgovers added the bug Something isn't working label Jan 14, 2025
@mgovers mgovers self-assigned this Jan 14, 2025
@mgovers mgovers enabled auto-merge January 14, 2025 14:09
figueroa1395
figueroa1395 previously approved these changes Jan 14, 2025
Signed-off-by: Martijn Govers <[email protected]>
Copy link
Contributor

@Jerry-Jinfeng-Guo Jerry-Jinfeng-Guo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this change prevents from appending? The original code was:

        batch_errors = input_errors + id_errors

        if not id_errors:
            batch_errors = input_errors
            # ...

The new one is:

        batch_errors = input_errors + id_errors

        if not id_errors:
            # ...

@mgovers
Copy link
Member Author

mgovers commented Jan 14, 2025

How does this change prevents from appending? The original code was:

        batch_errors = input_errors + id_errors

        if not id_errors:
            batch_errors = input_errors
            # ...

The new one is:

        batch_errors = input_errors + id_errors

        if not id_errors:
            # ...

In Python, everything is a reference, except if it's not

  • batch_errors = input_errors makes batch_errors a reference to input_errors
  • batch_errors = input_errors + id_errors makes a new list with the contents of input_errors, followed by the contents of id_errors

that means that if you append to batch_errors, then in the former case, you actually append to input_errors.

If you then create a reference when adding the current scenario errors to the full list, then in the former case, every scenario will point to input_errors. In the latter, you basically move the contents of batch_errors (new list) into the full error list (technically, it's creating a reference, updating the reference count of the former, and then when it goes out of scope, the reference counter is set back to 0, which is equivalent to moving the object)

@Jerry-Jinfeng-Guo
Copy link
Contributor

How does this change prevents from appending? The original code was:

        batch_errors = input_errors + id_errors

        if not id_errors:
            batch_errors = input_errors
            # ...

The new one is:

        batch_errors = input_errors + id_errors

        if not id_errors:
            # ...

OK, I got it now, this line duplicates the input error for all [batch] entry

@mgovers mgovers added this pull request to the merge queue Jan 14, 2025
Merged via the queue into main with commit b428bda Jan 14, 2025
28 checks passed
@mgovers mgovers deleted the bugfix/blow-up-of-validation-errors branch January 14, 2025 15:37
@Jerry-Jinfeng-Guo
Copy link
Contributor

Jerry-Jinfeng-Guo commented Jan 14, 2025

When you do complexity analysis, it's important not to confuse N. In this case, the complexity is N log(N), suppose one of them is at log scale of the other one, n=LogN, i.e., number_of_batches * sum_of_errors.

@mgovers
Copy link
Member Author

mgovers commented Jan 14, 2025

When you do complexity analysis, it's important not to confuse N. In this case, the complexity is N log(N), suppose one of them is at log scale of the other one, n=LogN, i.e., number_of_batches * sum_of_errors.

Actually no because each scenario updates the same object it references, and also a reference to the original object is used. That means that at scenario k, you also still update the errors referenced for scenario 0, and they all grow identically the same (interestingly, as a side effect, there is no additional memory overhead (linear in the amount of scenarios), but each error is referenced N times, resulting in the quadratic overhead.

Even if that were not the case, and each scenario would only contain a sum of the former, you would end up in triangle numbers, which is also quadratic: 1st scenatio references 1 error, 2nd scenario references 2 errors, ..., k references k errors. Total = $$\sum_{k=1}^n k = \frac{n(n-1)}{2}$$

@mgovers
Copy link
Member Author

mgovers commented Jan 14, 2025

When you do complexity analysis, it's important not to confuse N. In this case, the complexity is N log(N), suppose one of them is at log scale of the other one, n=LogN, i.e., number_of_batches * sum_of_errors.

Actually no because each scenario updates the same object it references, and also a reference to the original object is used. That means that at scenario k, you also still update the errors referenced for scenario 0, and they all grow identically the same (interestingly, as a side effect, there is no additional memory overhead (linear in the amount of scenarios), but each error is referenced N times, resulting in the quadratic overhead.

Note that this also was verified experimentally by @BartSchuurmans

@Jerry-Jinfeng-Guo
Copy link
Contributor

When you do complexity analysis, it's important not to confuse N. In this case, the complexity is N log(N), suppose one of them is at log scale of the other one, n=LogN, i.e., number_of_batches * sum_of_errors.

Actually no because each scenario updates the same object it references, and also a reference to the original object is used. That means that at scenario k, you also still update the errors referenced for scenario 0, and they all grow identically the same (interestingly, as a side effect, there is no additional memory overhead (linear in the amount of scenarios), but each error is referenced N times, resulting in the quadratic overhead.

Even if that were not the case, and each scenario would only contain a sum of the former, you would end up in triangle numbers, which is also quadratic: 1st scenatio references 1 error, 2nd scenario references 2 errors, ..., k references k errors. Total = ∑ k = 1 n k = n ( n − 1 ) 2

When looking at such problems the total amount of (non duplicate) errors is dominant, you don't (need to) look at the batch size. I am not saying NlogN is any better than quadratic, on the contrary, it might even lead to close to cubic complexity.

@Jerry-Jinfeng-Guo
Copy link
Contributor

When you do complexity analysis, it's important not to confuse N. In this case, the complexity is N log(N), suppose one of them is at log scale of the other one, n=LogN, i.e., number_of_batches * sum_of_errors.

Actually no because each scenario updates the same object it references, and also a reference to the original object is used. That means that at scenario k, you also still update the errors referenced for scenario 0, and they all grow identically the same (interestingly, as a side effect, there is no additional memory overhead (linear in the amount of scenarios), but each error is referenced N times, resulting in the quadratic overhead.

Note that this also was verified experimentally by @BartSchuurmans

Like I said, anything between nlogn (lower case) and cubic is to be expected.

@mgovers
Copy link
Member Author

mgovers commented Jan 15, 2025

When you do complexity analysis, it's important not to confuse N. In this case, the complexity is N log(N), suppose one of them is at log scale of the other one, n=LogN, i.e., number_of_batches * sum_of_errors.

Actually no because each scenario updates the same object it references, and also a reference to the original object is used. That means that at scenario k, you also still update the errors referenced for scenario 0, and they all grow identically the same (interestingly, as a side effect, there is no additional memory overhead (linear in the amount of scenarios), but each error is referenced N times, resulting in the quadratic overhead.

Note that this also was verified experimentally by @BartSchuurmans

Like I said, anything between nlogn (lower case) and cubic is to be expected.

How did you arrive at n = log N? I expect 2 types of data validation errors to happen in production: all scenarios suffer from the same issue (erroneous input data) or some scenarios suffer from some issue (erroneous update data). The former has a constant contribution to each scenario, and the latter is statistically distributed across the scenarios, which also contributes as a constant, but with a different prefactor.

That said: in the issue here, the amount of errors per scenario only contributes to the prefactor, not to the scaling in terms of the amount of scenarios, which is the issue described here.

@Jerry-Jinfeng-Guo
Copy link
Contributor

How did you arrive at n = log N?

That's the only assumption we could make. I see no need to further this discussion as it does not impact the fix you did here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants