-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Add support columnar data buffer to save memory usage #548
Comments
This will make the PyArrow conversions also easier ( https://github.com/orgs/PowerGridModel/discussions/2 ) Relates to PowerGridModel/power-grid-model-io#190 |
An alternative idea around python API Steps 2, 3:Additional arguments to the deserialization functions Old proposal: 2 arguments:
New: Only one additional argument:
Note: The difference in both option is the Example# All row based
data_filter = None
# All column based
data_filter = ...
# specific components row based
data_filter = ["node", "line"]
data_filter = {"node", "line"}
# specific components, mixed row-based and columnar data
data_filter = {
"node": None, # row-based
"line": ["id", "p_from"], # columnar with these attributes
"transformer": ... # columnar with all attributes # <--- (proposed) Ellipsis is a python built-in constant
} |
Implementation proposal step 7.ii:
Loosening the requirement that |
Good point. I have adjusted the step proposal. |
Continuing this thread and this thread from #799 for potential follow-up improvements to columnar data here because this issue is still alive and kicking. We can convert this to one or more new issues if necessary:
Not sure yet, it probably makes sense to directly comment something about it under
|
no action items, but since there was a miscommunication within the team, I will add the full customer journey illustrated in #548 (comment) here:
|
After #814 , the only thing remaining is:
I propose to make a separate issue for that so that this issue can be closed before the release. |
Background
Power Grid Model currently uses the row-based buffer to share the data across the C-API. For example, the memory layout of a node input buffer looks like:
Where
X
is a meaningful byte andO
is an empty byte to align the memory.In this way we can match the input/update/output data structs exactly as we do in the calculation core. This can deliver the performance benefits as we avoid any copies and the memory layout is exactly matching.
When we need to leave some attributes unspecified in input/update, we set the pre-defined
null
value defined in the place so the core knows the value should be ignored.Problem
While this design is CPU-wise very efficient, it could be memory-wise inefficient due to several reasons:
null
.There is a strong case to support columnar data buffers. We give two real-world examples of this issue.
Example of update buffer
If we have an update buffer of 1000 scenarios of 1000
sym_load
, the buffer size is24 * 1000 * 1000 = 24,000,000
bytes. However, we might only need to specifyid
andp_specified
. If we could provide these two array separately, the buffer size in total is(8 + 4) * 1000 * 1000 = 12,000,000
bytes. The reduction on memory footprint is 50%!Example of output buffer
If we get a result buffer of 1000 scenarios of 1000
line
, the buffer size is80 * 1000 * 1000 = 80,000,000
bytes. However, we might only need to know theloading
output, not evenid
, since we already know theid
order in the input. The buffer size is8 * 1000 * 1000 = 8,000,000
bytes. We can save 90% of memory footprint!Proposal
We propose to support columnar data buffers across the C-API (and further in the Python API). Both the PGM core and serialization need to support that.
C-API
We already have the
dataset
concept in the C-API boundary. Therefore, this feature should not have breaking change in the C-API. Concretely, we add additional functions asPGM_dataset_*_add_attribute_buffer
to allow user add columnar attribute buffers to the dataset. The user can call the dataset as below:Python API
In the Python API, four non-breaking changes are expected.
initialize_array
) or a dictionary of numpy homogeneous arrays (e.g.{"id": [1, 2], "u_rated": [150e3, 10e3]}
).output_component_types
. The Python wrapper needs to decide whether to create a structured array or dictionary of homogeneous arrays per component. We need to figure how maintain backwards compatibility.Decision made on step 3 (deserialization):
For deserialization we support either row or column based deserialization (function argument: Enum). If a user wants to deserialize to columnar data the default is to deserialize all data present. A user can give an Optional function argument to specify the desired components and attributes. In that case, deserialization + a filter (for the specific components and attributes) is happening. Let's call this Optional function argument
filter
. Make sure this behavior is documented well + document that providing a filter might result in loss of data.Make id optional for batch update dataset in columnar format
From the user's perspective, the user would definitely like to provide a columnar batch dataset in a way that the
id
is not provided for a certain component. In that case, it should be inferred that the elements where attributes are to be updated via columnar buffer are in the exact same sequence of the input data. This is a realistic use-case and will be appreciated by the user, to save the additional step to just assign the exactly the sameid
as in the input data. The following Python code should work:Implementation Proposal
To make this feature possible, following implementation suggestions are proposed in the C++ core:
DatasetHandler
toDataset
.Dataset
.Dataset
, add buffer control and iteration functionality. It can detect if a component buffer is row or column based, and in case of column based, generate temporary object to have the full struct forMainModel
to consume.MainModel
to use the newDataset
. This also relates to Remove the Dataset logic from PGM core, use DatasetHandler for MainModel #431.Serializer
, it should directly read the row and column based buffer and serialize them tomsgpack
andjson
.Deserializer
, it should write the attributes either in row- or column-based depending on what buffers are set in theWritableDataset
.is_update_independent
to makeid
as optional attribute in the batch update dataset.is_update_independent
should be per component instead of the whole dataset. So we can allow individualsequence
for each component.id
of the row-based buffer is not allNaN
, we use the current logic to determine if the component is independent.id
of the row-based buffer is allNaN
elements_per_scenario
is not the same as the number of elements in the input data (in the model). An error should be raised.sequence
from0
ton_comp
for this component. This will be consumed by the update function so the update function does not doid
lookup.id
attribute buffer is provided and it is not allNaN
, we look atid
to judge if the component is independent or not. We do not need to create proxy stuff which is waste of time. Just directly look atid
buffer.id
attribute buffer is not provided or if theid
is provided but they are allNaN
:elements_per_scenario
is not the same as the number of elements in the input data (in the model). An error should be raised.sequence
from0
ton_comp
for this component. This will be consumed by the update function so the update function does not doid
lookup.The text was updated successfully, but these errors were encountered: