-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/columnar data serialization #708
Conversation
81011bd
to
6ab07b4
Compare
Signed-off-by: Martijn Govers <[email protected]>
Signed-off-by: Martijn Govers <[email protected]>
Signed-off-by: Martijn Govers <[email protected]>
Signed-off-by: Martijn Govers <[email protected]>
Signed-off-by: Martijn Govers <[email protected]>
6ab07b4
to
c7e47b2
Compare
Signed-off-by: Martijn Govers <[email protected]>
power_grid_model_c/power_grid_model/include/power_grid_model/auxiliary/dataset.hpp
Show resolved
Hide resolved
...d_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/deserializer.hpp
Outdated
Show resolved
Hide resolved
...rid_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/serializer.hpp
Outdated
Show resolved
Hide resolved
...rid_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/serializer.hpp
Outdated
Show resolved
Hide resolved
Signed-off-by: Martijn Govers <[email protected]>
Signed-off-by: Martijn Govers <[email protected]>
Signed-off-by: Martijn Govers <[email protected]>
…com/PowerGridModel/power-grid-model into feature/columnar-data-serialization Signed-off-by: Martijn Govers <[email protected]>
power_grid_model_c/power_grid_model/include/power_grid_model/auxiliary/dataset.hpp
Show resolved
Hide resolved
power_grid_model_c/power_grid_model/include/power_grid_model/auxiliary/meta_data.hpp
Show resolved
Hide resolved
...rid_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/serializer.hpp
Outdated
Show resolved
Hide resolved
...rid_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/serializer.hpp
Outdated
Show resolved
Hide resolved
...rid_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/serializer.hpp
Outdated
Show resolved
Hide resolved
...rid_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/serializer.hpp
Outdated
Show resolved
Hide resolved
...d_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/deserializer.hpp
Outdated
Show resolved
Hide resolved
Signed-off-by: Martijn Govers <[email protected]>
...d_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/deserializer.hpp
Outdated
Show resolved
Hide resolved
Signed-off-by: Martijn Govers <[email protected]>
Signed-off-by: Martijn Govers <[email protected]>
...d_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/deserializer.hpp
Outdated
Show resolved
Hide resolved
...d_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/deserializer.hpp
Outdated
Show resolved
Hide resolved
...d_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/deserializer.hpp
Outdated
Show resolved
Hide resolved
...d_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/deserializer.hpp
Outdated
Show resolved
Hide resolved
...d_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/deserializer.hpp
Show resolved
Hide resolved
Signed-off-by: Martijn Govers <[email protected]>
...d_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/deserializer.hpp
Outdated
Show resolved
Hide resolved
...d_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/deserializer.hpp
Outdated
Show resolved
Hide resolved
Signed-off-by: Martijn Govers <[email protected]>
The deserialization part is looking good now. |
Signed-off-by: Martijn Govers <[email protected]>
Signed-off-by: Martijn Govers <[email protected]>
Signed-off-by: Martijn Govers <[email protected]>
...rid_model_c/power_grid_model/include/power_grid_model/auxiliary/serialization/serializer.hpp
Show resolved
Hide resolved
Signed-off-by: Martijn Govers <[email protected]>
Signed-off-by: Martijn Govers <[email protected]>
Signed-off-by: Martijn Govers <[email protected]>
Quality Gate passedIssues Measures |
@mgovers PR now in the merge queue. After merge, maybe you can run a serialization benchmark (using your old dataset) to compare the old and new version of row-based (de-)serialization performance. It should not have significant change. |
bad news: the parsing step in the deserializer takes excessively longer. will update dataset shape v1.9.35
v1.9.34
file sizes
I will follow-up on this |
@mgovers please format the table of 1.9.35 |
benchmark script # SPDX-FileCopyrightText: Contributors to the Power Grid Model project <[email protected]>
#
# SPDX-License-Identifier: MPL-2.0
from pathlib import Path
from timeit import timeit
from typing import Callable
import numpy as np
from power_grid_model import initialize_array
from power_grid_model.core.dataset_definitions import ComponentType, DatasetType
from power_grid_model.utils import json_deserialize_from_file, msgpack_deserialize_from_file, msgpack_serialize_to_file, json_serialize_to_file
# def generate(shape):
# data = {}
# for comp in ComponentType:
# arr = initialize_array(DatasetType.update, comp, shape=shape, empty=True)
# for attribute in arr.dtype.names:
# arr[attribute] = np.random.rand(*arr[attribute].shape)
# data[comp] = arr
# print("done generating")
# return data
def serialize(func: Callable, file_path: Path, data, **kwargs):
def execute():
print(f"serializing: {file_path}")
func(file_path, data, **kwargs)
print(f"done serializing: {file_path}")
print(f"timed {func.__name__}({file_path}): {timeit(execute, number=1)} s")
def deserialize(func: Callable, file_path: Path):
def execute():
print(f"deserializing: {file_path}")
func(file_path)
print(f"done deserializing: {file_path}")
print(f"timed {func.__name__}({file_path}): {timeit(execute, number=1)} s")
if __name__ == "__main__":
shape = (800, 800)
# update_data = generate(shape)
update_data = msgpack_deserialize_from_file(Path("tmp/data/base_data.msgpack"))
serialize(msgpack_serialize_to_file, Path("tmp/data/all_random_compact.msgpack"), update_data, use_compact_list=True)
deserialize(msgpack_deserialize_from_file, Path("tmp/data/all_random_compact.msgpack"))
serialize(msgpack_serialize_to_file, Path("tmp/data/all_random.msgpack"), update_data)
deserialize(msgpack_deserialize_from_file, Path("tmp/data/all_random.msgpack"))
serialize(json_serialize_to_file, Path("tmp/data/all_random_compact.json"), update_data, use_compact_list=True)
deserialize(json_deserialize_from_file, Path("tmp/data/all_random_compact.json"))
serialize(json_serialize_to_file, Path("tmp/data/all_random.json"), update_data)
deserialize(json_deserialize_from_file, Path("tmp/data/all_random.json")) test script // SPDX-FileCopyrightText: Contributors to the Power Grid Model project <[email protected]>
//
// SPDX-License-Identifier: MPL-2.0
#include <power_grid_model/auxiliary/input.hpp>
#include <power_grid_model/auxiliary/meta_data_gen.hpp>
#include <power_grid_model/auxiliary/serialization/deserializer.hpp>
#include <power_grid_model/auxiliary/update.hpp>
#include <fstream>
#include <sstream>
namespace {
using namespace power_grid_model;
using namespace power_grid_model::meta_data;
} // namespace
int main() {
std::string str = [] {
std::ifstream f{"<some_file>.json"};
std::stringstream sstr;
sstr << f.rdbuf();
return sstr.str();
}();
auto deserializer = Deserializer{from_json, str, meta_data_gen::meta_data};
auto dataset = deserializer.get_dataset_info();
auto info = dataset.get_description();
std::vector<std::vector<char>> buffers{};
for (Idx idx = 0; idx < meta_data_gen::meta_data.get_dataset("update").n_components(); ++idx) {
buffers.push_back(
std::vector<char>(info.component_info[idx].total_elements * info.component_info[idx].component->size));
dataset.set_buffer(info.component_info[idx].component->name, nullptr, buffers[idx].data());
}
deserializer.parse();
return 0;
} |
@mgovers there is a major overhead (>90%) in the new de-serializer. We need to dump the data into a file and write a cpp executable to run the data. Then we can profile the overhead. |
i found out what's wrong. attempting to fix now |
same test script but using msgpack: // SPDX-FileCopyrightText: Contributors to the Power Grid Model project <[email protected]>
//
// SPDX-License-Identifier: MPL-2.0
#include <power_grid_model/auxiliary/input.hpp>
#include <power_grid_model/auxiliary/meta_data_gen.hpp>
#include <power_grid_model/auxiliary/serialization/deserializer.hpp>
#include <power_grid_model/auxiliary/update.hpp>
#include <fstream>
#include <sstream>
namespace {
using namespace power_grid_model;
using namespace power_grid_model::meta_data;
} // namespace
int main() {
std::vector<char> result = [] {
using namespace std::string_view_literals;
constexpr auto file_path = "<msgpack_file>";
std::vector<char> result(std::filesystem::file_size(file_path));
std::ifstream f{file_path, std::ios::binary};
f.read(result.data(), result.size());
return result;
}();
auto deserializer = Deserializer{from_msgpack, result, meta_data_gen::meta_data};
auto dataset = deserializer.get_dataset_info();
auto info = dataset.get_description();
std::vector<std::vector<char>> buffers{};
for (Idx idx = 0; idx < meta_data_gen::meta_data.get_dataset("update").n_components(); ++idx) {
buffers.push_back(
std::vector<char>(info.component_info[idx].total_elements * info.component_info[idx].component->size));
dataset.set_buffer(info.component_info[idx].component->name, nullptr, buffers[idx].data());
}
deserializer.parse();
return 0;
} |
} else { | ||
parse_component(columnar, component_idx); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found out that this call will cause major overhead when no attributes are present. That means that the logic concerning empty attribute buffers is incorrect
} else { | |
parse_component(columnar, component_idx); | |
} | |
} else if (!dataset_handler_.get_buffer(component_idx).attributes.empty()) { | |
parse_component(columnar, component_idx); | |
} |
This is not a sustainable fix (it points at the fact that there is some other problem) but will resolve the issue in the short-term
data | serializer creation | serialization | serialization total | deserializer creation + pre-parse | deserializer parse | deserializer total |
---|---|---|---|---|---|---|
msgpack [compact] | 0.0004923000233247876 s | 2.3669713999843225 s | 3.0552195999771357 s | 0.28873490006662905 s | 2.858401000034064 s | 3.285264000063762 s |
msgpack [regular] | 0.0007754000835120678 s | 2.9432286999654025 s | 3.9057689999463037 s | 0.5173835000023246 s | 5.383461399935186 s | 6.166236900025979 s |
json [compact] | 0.00043050001841038465 s | 29.926429499988444 s | 63.13568380009383 s | 14.445208700024523 s | 3.2074557000305504 s | 19.59107560000848 s |
json [regular] | 0.0008066999725997448 s | 37.490393099957146 s | 78.83358660002705 s | 18.796793300076388 s | 5.285955099971034 s | 26.85202120000031 s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What has changed in your suggestion to avoid the overhead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your test scripts are testing row-based de-serialization right? So the else
or else if
in the segment is never reached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turns out it is... there's definitely an issue. I have very little time left before i need to go, though. Should we roll-back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok found the problem in my test script (auto
where there should've been auto&
when declaring the dataset)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. So, the buffers are never set in the de-serializer. And all the components are treated column-based with no attributes.
Step 5 & 6 of Implementation #548