[FEATURE] Add support columnar data buffer to save memory usage #548

TonyXiang8787 · 2024-03-26T13:26:28Z

Background

Power Grid Model currently uses the row-based buffer to share the data across the C-API. For example, the memory layout of a node input buffer looks like:

| XXXX OOOO  XXXXXXXX  | XXXX OOOO  XXXXXXXX  | ... |
| id_0       u_rated_0 | id_1       u_rated_1 | ... |

Where X is a meaningful byte and O is an empty byte to align the memory.

In this way we can match the input/update/output data structs exactly as we do in the calculation core. This can deliver the performance benefits as we avoid any copies and the memory layout is exactly matching.

When we need to leave some attributes unspecified in input/update, we set the pre-defined null value defined in the place so the core knows the value should be ignored.

Problem

While this design is CPU-wise very efficient, it could be memory-wise inefficient due to several reasons:

If you have many unspecified attributes, you waste the memory to have many null.
To align the data structure, the compiler will add paddings, as shown above. Those memory will not be used at all.
If we are only interested in certain attributes in the output, we still need to create the memory buffer for all the attributes. See the related issue here: [FEATURE] Output results of only certain attributes #542.

There is a strong case to support columnar data buffers. We give two real-world examples of this issue.

Example of update buffer

If we have an update buffer of 1000 scenarios of 1000 sym_load, the buffer size is 24 * 1000 * 1000 = 24,000,000 bytes. However, we might only need to specify id and p_specified. If we could provide these two array separately, the buffer size in total is (8 + 4) * 1000 * 1000 = 12,000,000 bytes. The reduction on memory footprint is 50%!

Example of output buffer

If we get a result buffer of 1000 scenarios of 1000 line, the buffer size is 80 * 1000 * 1000 = 80,000,000 bytes. However, we might only need to know the loading output, not even id, since we already know the id order in the input. The buffer size is 8 * 1000 * 1000 = 8,000,000 bytes. We can save 90% of memory footprint!

Proposal

We propose to support columnar data buffers across the C-API (and further in the Python API). Both the PGM core and serialization need to support that.

C-API

We already have the dataset concept in the C-API boundary. Therefore, this feature should not have breaking change in the C-API. Concretely, we add additional functions as PGM_dataset_*_add_attribute_buffer to allow user add columnar attribute buffers to the dataset. The user can call the dataset as below:

// create dataset
PGM_MutableDataset* dataset = PGM_create_dataset_mutable(handle, "input", 0, 0); 

// add row-based buffer for node
PGM_dataset_const_add_buffer(handle, dataset, "node", 5, 5, nullptr, node_buffer);

// add empty buffer for line by put nullptr in data, but with size definition
// decision made: do it this way, because using nullptr is a common way in C API to communicate that the buffer does not exist yet
PGM_dataset_const_add_buffer(handle, dataset, "line", 5, 5, nullptr, nullptr);
// and then add individual line attributes
PGM_dataset_const_add_attribute_buffer(handle, dataset, "line", "id", line_id_buffer);
PGM_dataset_const_add_attribute_buffer(handle, dataset, "line", "r1", line_r1_buffer);

// add row buffer for sym_load
PGM_dataset_const_add_buffer(handle, dataset, "sym_load", 5, 5, nullptr, sym_load_buffer);
// the following line should return error. It is not allowed to add attribute buffer if the row-buffer is set.
PGM_dataset_const_add_attribute_buffer(handle, dataset, "sym_load", "p_specified", p_buffer);

// use the dataset for further actions like initialize model, put into serialization, etc.

Python API

In the Python API, four non-breaking changes are expected.

In all the places where a PGM dataset is expected, for each component, the user should be able to also supply either a numpy structured array (returned by initialize_array) or a dictionary of numpy homogeneous arrays (e.g. {"id": [1, 2], "u_rated": [150e3, 10e3]}).
In the calculate functions, the user can put desired components and/or attributes in output_component_types. The Python wrapper needs to decide whether to create a structured array or dictionary of homogeneous arrays per component. We need to figure how maintain backwards compatibility.
In the deserialization, user can specify if they want row- or column-based arrays in the returned dataset. See decision below for more information
In the serialization, if the user gives a dataset which are all column-based, the user has to provide a dataset type because there is no way to deduce it. As long as there one row-based array in the user-provided dataset, we can still deduce the dataset type.

Decision made on step 3 (deserialization):
For deserialization we support either row or column based deserialization (function argument: Enum). If a user wants to deserialize to columnar data the default is to deserialize all data present. A user can give an Optional function argument to specify the desired components and attributes. In that case, deserialization + a filter (for the specific components and attributes) is happening. Let's call this Optional function argument filter. Make sure this behavior is documented well + document that providing a filter might result in loss of data.

Make id optional for batch update dataset in columnar format

From the user's perspective, the user would definitely like to provide a columnar batch dataset in a way that the id is not provided for a certain component. In that case, it should be inferred that the elements where attributes are to be updated via columnar buffer are in the exact same sequence of the input data. This is a realistic use-case and will be appreciated by the user, to save the additional step to just assign the exactly the same id as in the input data. The following Python code should work:

model = PowerGridModel(input_data=input_data)
result = model.calculate_power_flow(
    update_data={
        "sym_load": {"p_specified": np.random.randn(n_step, n_sym_load)}
    }
)

Implementation Proposal

To make this feature possible, following implementation suggestions are proposed in the C++ core:

(done)Rename DatasetHandler to Dataset.
(done)Remove old Dataset.
In the new Dataset, add buffer control and iteration functionality. It can detect if a component buffer is row or column based, and in case of column based, generate temporary object to have the full struct for MainModel to consume.
(done)Rewrite MainModel to use the new Dataset. This also relates to Remove the Dataset logic from PGM core, use DatasetHandler for MainModel #431.
In Serializer, it should directly read the row and column based buffer and serialize them to msgpack and json.
In Deserializer, it should write the attributes either in row- or column-based depending on what buffers are set in the WritableDataset.
Make id optional in update dataset: in the main core, we need to have special treatment in is_update_independent to make id as optional attribute in the batch update dataset.
1. is_update_independent should be per component instead of the whole dataset. So we can allow individual sequence for each component.
2. For a certain component, if the buffer is row-based
  1. If the id of the row-based buffer is not all NaN, we use the current logic to determine if the component is independent.
  2. If the id of the row-based buffer is all NaN
    1. If the buffer is not uniform, or the buffer is uniform but elements_per_scenario is not the same as the number of elements in the input data (in the model). An error should be raised.
    2. If the above check passes, we assume the component buffer is independent. And we generate a sequence from 0 to n_comp for this component. This will be consumed by the update function so the update function does not do id lookup.
3. For a certain component, if the buffer is columnar, we do the following:
  1. If id attribute buffer is provided and it is not all NaN, we look at id to judge if the component is independent or not. We do not need to create proxy stuff which is waste of time. Just directly look at id buffer.
  2. If id attribute buffer is not provided or if the id is provided but they are all NaN:
    1. If the buffer is not uniform, or the buffer is uniform but elements_per_scenario is not the same as the number of elements in the input data (in the model). An error should be raised.
    2. If the above check passes, we assume the component buffer is independent. And we generate a sequence from 0 to n_comp for this component. This will be consumed by the update function so the update function does not do id lookup.

The text was updated successfully, but these errors were encountered:

mgovers · 2024-04-18T11:05:01Z

This will make the PyArrow conversions also easier ( https://github.com/orgs/PowerGridModel/discussions/2 )

Relates to PowerGridModel/power-grid-model-io#190

nitbharambe · 2024-08-12T12:01:28Z

An alternative idea around python API Steps 2, 3:

Additional arguments to the deserialization functions

Old proposal: 2 arguments:

dataset_filter as set[ComponentType] | list[ComponentType] | None | dict[ComponentType, set[str] | list[str] | None]
dataset_format as Enum: row_based or columnar

New: Only one additional argument:

dataset_filter as set[ComponentType] | list[ComponentType] | None | ... | dict[ComponentType, set[str] | list[str] | None | ...]
PGM deduces dataset format from that data_filter.

Note: The difference in both option is the ... option.

Example

# All row based
data_filter = None

# All column based
data_filter = ...

# specific components row based
data_filter = ["node", "line"]
data_filter = {"node", "line"}

# specific components, mixed row-based and columnar data
data_filter = {
  "node": None,       # row-based
  "line": ["id", "p_from"],  # columnar with these attributes
  "transformer": ...         # columnar with all attributes # <--- (proposed) Ellipsis is a python built-in constant
}

mgovers · 2024-09-06T06:53:53Z

Implementation proposal step 7.ii:

Make id optional in update dataset: in the main core, we need to have special treatment in is_update_independent to make id as optional attribute in the batch update dataset.
ii. For a certain component, if the buffer is row-based, we use the same logic as is now.

Loosening the requirement that id be present in columnar data means that the serialized data may not contain id anymore. If a columnar dataset without ids is serialized and then deserialized as row-based dataset, then the row-based dataset does not contain ids anymore either (all na_ID). According to your proposed logic, this is not allowed to be used as update data. This breaks the symmetry between row-based datasets and columnar datasets. I don't think we should do that, which means that the id-less lookup logic should also be updated for row-based datasets

TonyXiang8787 · 2024-09-06T07:12:39Z

Implementation proposal step 7.ii:

Make id optional in update dataset: in the main core, we need to have special treatment in is_update_independent to make id as optional attribute in the batch update dataset.
ii. For a certain component, if the buffer is row-based, we use the same logic as is now.

Loosening the requirement that id be present in columnar data means that the serialized data may not contain id anymore. If a columnar dataset without ids is serialized and then deserialized as row-based dataset, then the row-based dataset does not contain ids anymore either (all na_ID). According to your proposed logic, this is not allowed to be used as update data. This breaks the symmetry between row-based datasets and columnar datasets. I don't think we should do that, which means that the id-less lookup logic should also be updated for row-based datasets

Good point. I have adjusted the step proposal.

mgovers · 2024-10-22T07:04:00Z

Continuing this thread and this thread from #799 for potential follow-up improvements to columnar data here because this issue is still alive and kicking. We can convert this to one or more new issues if necessary:

Since we don't have a create buffer function for columnar buffers (since probably there is no value in it due to the nature of columnar data), we should document this somewhere as there is an "inconsistency" on how row and columnar data are managed at the C API level. Or at least those are my thought, it may lead to confusion from the user perspective.

I've also been thinking about that, but I don't know where to put it. Do you have a good proposal?

Not sure yet, it probably makes sense to directly comment something about it under PGM_create_buffer, but it should probably be standalone somewhere else.

Should we (at some point) make something like is_columnar available through the C API? It may not be needed, but given that we offer functions to check if something is batch and its size, it would probably go along with those (and may even provide value in the python side).

I think that makes sense, but maybe out of scope of v1.10.0. @TonyXiang8787 do you have any thoughts on this?

mgovers · 2024-10-29T08:59:30Z

Implementation proposal step 7.ii:

Make id optional in update dataset: in the main core, we need to have special treatment in is_update_independent to make id as optional attribute in the batch update dataset.
ii. For a certain component, if the buffer is row-based, we use the same logic as is now.

Loosening the requirement that id be present in columnar data means that the serialized data may not contain id anymore. If a columnar dataset without ids is serialized and then deserialized as row-based dataset, then the row-based dataset does not contain ids anymore either (all na_ID). According to your proposed logic, this is not allowed to be used as update data. This breaks the symmetry between row-based datasets and columnar datasets. I don't think we should do that, which means that the id-less lookup logic should also be updated for row-based datasets

Good point. I have adjusted the step proposal.

no action items, but since there was a miscommunication within the team, I will add the full customer journey illustrated in #548 (comment) here:

a production environment that uses columnar data.
a. to save on data, they do not use IDs in their update data
an error occurs in the production environment
a. they catch this and the POD automatically creates debug input and update data using json_serialize_to_file (without any additional options) so that they have a repro case
b. the update data now contains no IDs
the dev starts debugging.
a. they create a model from the input data
b. they use json_deserialize_from_file (without any additional options)
c. they obtain a row-based dataset (not columnar)
d. their loaded update data does not contain IDs
debugging in progress
a. assuming row-based and columnar behave differently):
1. they cannot reproduce the same error and dive into a rabbit-hole that is not what they want to debug
2. we get a support request. we easily find the error, cause it's pretty obvious what's up
3. they complain because they did not expect the two datasets to behave differently
b. assuming row-based and columnar do behave the same):
1. they can reproduce the error
2. they find the problem and solve it by themselves

mgovers · 2024-11-06T06:38:42Z

After #814 , the only thing remaining is:

Make id optional in update dataset: in the main core, we need to have special treatment in is_update_independent to make id as optional attribute in the batch update dataset.

i. is_update_independent should be per component instead of the whole dataset. So we can allow individual sequence for each component.

I propose to make a separate issue for that so that this issue can be closed before the release.

figueroa1395 · 2024-11-07T13:16:24Z

List of related PRs to this issue: #676, #678, #663, #708, #687, #716, #712, #711, #732, #734, #746, #757, #699, #759, #758, #760, #768, #783, #785, #788, #799, #801, #807, #814, #817, #819, #821

TonyXiang8787 added the feature New feature or request label Mar 26, 2024

TonyXiang8787 added this to Power Grid Model Mar 26, 2024

TonyXiang8787 changed the title ~~[FEATURE] Add support columnar data buffer to save memory usage (WIP)~~ [FEATURE] Add support columnar data buffer to save memory usage Mar 26, 2024

petersalemink95 moved this to Q3 2024 in Power Grid Model Jun 11, 2024

mgovers mentioned this issue Jun 14, 2024

[FEATURE] *Direct Handling of Electrical Parameters in JSON Interface* #642

Closed

mgovers mentioned this issue Jul 5, 2024

add columnar data to data types (but don't use yet) #663

Merged

This was referenced Jul 15, 2024

Feature/test row based dataset #675

Merged

Feature/columnar dataset #678

Merged

nitbharambe mentioned this issue Aug 9, 2024

columnar data python api serialization #687

Merged

nitbharambe mentioned this issue Aug 14, 2024

[FEATURE] Output results of only certain attributes #542

Closed

mgovers mentioned this issue Aug 30, 2024

Feature/columnar data serialization #708

Merged

Jerry-Jinfeng-Guo mentioned this issue Sep 3, 2024

Feature / Columnar data C API #711

Merged

6 tasks

TonyXiang8787 mentioned this issue Sep 5, 2024

Feature / Add columnar data support in main model #712

Merged

4 tasks

mgovers mentioned this issue Sep 30, 2024

[FEATURE] *Provide the power system mismatch* #259

Open

Jerry-Jinfeng-Guo mentioned this issue Oct 8, 2024

Feature / columnar data optional id field support in update data #771

Merged

14 tasks

mgovers pinned this issue Oct 9, 2024

TonyXiang8787 unpinned this issue Oct 9, 2024

mgovers mentioned this issue Oct 14, 2024

[Feature] columnar data Python API docs #781

Merged

nitbharambe mentioned this issue Oct 14, 2024

Columnar data documentation (Terminology) #783

Merged

mgovers mentioned this issue Oct 15, 2024

[Feature] - columnar data examples #785

Merged

2 tasks

nitbharambe mentioned this issue Oct 16, 2024

Columnar data docs - native data interface serialization #788

Merged

Jerry-Jinfeng-Guo mentioned this issue Oct 17, 2024

Feature / Update Python validation to support optional id in update data #793

Merged

5 tasks

mgovers mentioned this issue Oct 21, 2024

improve C API documentation #799

Merged

This was referenced Oct 22, 2024

Feature - columnar performance docs #801

Merged

[Release] v1.10.0 #803

Closed

figueroa1395 mentioned this issue Oct 30, 2024

Feature/Update documentation for optional update IDs #814

Merged

4 tasks

mgovers mentioned this issue Nov 6, 2024

Add caching, sequencing and optional id support on a per-component-basis #820

Merged

mgovers closed this as completed in #814 Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Add support columnar data buffer to save memory usage #548

[FEATURE] Add support columnar data buffer to save memory usage #548

TonyXiang8787 commented Mar 26, 2024 •

edited

Loading

mgovers commented Apr 18, 2024

nitbharambe commented Aug 12, 2024 •

edited

Loading

mgovers commented Sep 6, 2024

TonyXiang8787 commented Sep 6, 2024

mgovers commented Oct 22, 2024 •

edited

Loading

mgovers commented Oct 29, 2024 •

edited

Loading

mgovers commented Nov 6, 2024 •

edited

Loading

figueroa1395 commented Nov 7, 2024 •

edited

Loading

[FEATURE] Add support columnar data buffer to save memory usage #548

[FEATURE] Add support columnar data buffer to save memory usage #548

Comments

TonyXiang8787 commented Mar 26, 2024 • edited Loading

Background

Problem

Example of update buffer

Example of output buffer

Proposal

C-API

Python API

Make id optional for batch update dataset in columnar format

Implementation Proposal

mgovers commented Apr 18, 2024

nitbharambe commented Aug 12, 2024 • edited Loading

An alternative idea around python API Steps 2, 3:

Example

mgovers commented Sep 6, 2024

TonyXiang8787 commented Sep 6, 2024

mgovers commented Oct 22, 2024 • edited Loading

mgovers commented Oct 29, 2024 • edited Loading

mgovers commented Nov 6, 2024 • edited Loading

figueroa1395 commented Nov 7, 2024 • edited Loading

TonyXiang8787 commented Mar 26, 2024 •

edited

Loading

nitbharambe commented Aug 12, 2024 •

edited

Loading

mgovers commented Oct 22, 2024 •

edited

Loading

mgovers commented Oct 29, 2024 •

edited

Loading

mgovers commented Nov 6, 2024 •

edited

Loading

figueroa1395 commented Nov 7, 2024 •

edited

Loading