Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DF] Add initial implementation for snapshotting to RNTuple #15750

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

enirolf
Copy link
Contributor

@enirolf enirolf commented Jun 4, 2024

This PR adds a first iteration of snapshotting to RNTuple from an RDataFrame. It uses the existing Snapshot interface, with an addition to RSnapshotOptions, kOutputFormat. This option can be set to write to either TTree, RNTuple, or take the default choice. The table below describes how Snapshot behaves accoring to the output format option:

From TTree From RNTuple From other DS
To TTree ESnapshotOutputFormat::kDefault ESnapshotOutputFormat::kTTree ESnapshotOutputFormat::kDefault
To RNTuple Not yet possible, will be added in a follow-up, using functionality from RNTupleImporter ESnapshotOutputFormat::kDefault ESnapshotOutputFormat::kRNTuple

Implementation

As mentioned, the existing Snapshot interface is used. A new SnapshotRNTupleHelper has been created to handle the creation and writing of the RNTuple, akin to the existing SnapshotHelper (which has been renamed to SnapshotTTreeHelper for consistency).

RLoopManager data source initialization (rev bbf221f)

The snapshot action creates a new loop manager which manages the snapshotted data set. The loop manager gets initialized before the actual snapshotting takes place. Originally, the pointer to the data source owned by the loop manager was marked as const. Because the RNTuple's data source has to be created after the loop manager, for this PR the const qualifier has been dropped.

Move ROOT::RDF::Experimental::FromRNTuple (rev 0a29b02)

For snapshotting RNTuples, we need to include the header file for RNTupleDS in ActionHelpers.hxx. To avoid dependency conflicts related to including ROOT/RDataFrame.hxx, the free FromRNTuple functions have been moved to a separate header.

Current limitations and follow-ups

This PR adds the minimal functionality for (single-threaded) snapshotting to RNTuple. A number of follow-ups are foreseen:

RNTuple write options

Currently no RNTuple-specific write options have been added to RSnapshotOptions yet, except for compression settings which were already present as an option. Adding (a subset) of the other RNTupleWriteOptions is trivial.

Default compression settings

RSnapshotOptions' default compression setting is 101 (Zlib). However, RNTuple's default compression setting is 505 (zstd). We could change the default compression setting to kInherit and decide which settings to use according to the target data format (unless explicitly set by the user, of course).

Multithreaded snapshotting

This PR only adds single-threaded RNTuple snapshotting. Multithreaded (and parallel) snapshotting will be addressed in a follow-up PR.

Tests

Corresponding roottest PR: root-project/roottest#1178

Tests for Windows have been disabled, due to permission denied-errors related to trying to recreate currently open TFiles. The regular snapshot tests have also been disabled for Windows, presumably for the same reason.

Copy link

github-actions bot commented Aug 16, 2024

Test Results

    13 files      13 suites   2d 18h 49m 0s ⏱️
 3 026 tests  3 026 ✅ 0 💤 0 ❌
33 817 runs  33 817 ✅ 0 💤 0 ❌

Results for commit fbc8f74.

♻️ This comment has been updated with latest results.

@enirolf enirolf force-pushed the rdf-ntuple-snapshot branch 3 times, most recently from c086453 to 3c28455 Compare August 19, 2024 12:04
@enirolf enirolf added the clean build Ask CI to do non-incremental build on PR label Aug 19, 2024
To be able to snapshot to RNTuple, we need to be able to initialize the
data source *after* the loop manager has been created. As an effect,
`RInterface` cannot statically store a (raw) pointer to this data source
anymore, so instead we introduce a getter that returns the loop
manager's current data source.
For snapshotting RNTuples, we need to include the header file for
RNTupleDS in `ActionHelpers.hxx`. To avoid dependency conflicts related
to including `ROOT/RDataFrame.hxx`, we move the free `FromRNTuple`
functions to a separate header.
@enirolf enirolf force-pushed the rdf-ntuple-snapshot branch from 3c28455 to c86bee3 Compare August 26, 2024 08:51
@enirolf enirolf removed the clean build Ask CI to do non-incremental build on PR label Aug 26, 2024
@enirolf enirolf force-pushed the rdf-ntuple-snapshot branch from c86bee3 to 932771a Compare August 26, 2024 11:57
This initial implemenation only supports single-threaded snapshotting.
When the original datasource is an RNTuple, the resulting snapshot will
be an RNTuple by default. This can be changed to TTree through the
RSnapshotOptions. Snapshotting from TTree to RNTuple is not yet
supported in this version, but will be added in the future. Snapshotting
from other data sources or dataframes created from scratch to RNTuple is
supported.
@enirolf enirolf force-pushed the rdf-ntuple-snapshot branch from 932771a to fbc8f74 Compare August 26, 2024 13:57
enirolf added a commit to enirolf/roottest that referenced this pull request Aug 26, 2024
Necessary for Snapshot-related tests, now that snapshotting to RNTuple
is being added (root-project/root#15750).
@enirolf enirolf changed the title [WIP][DF] Add snapshotting to RNTuple [DF] Add initial implementation for snapshotting to RNTuple Aug 27, 2024
@enirolf enirolf marked this pull request as ready for review August 27, 2024 07:13
@enirolf enirolf requested review from jblomer and dpiparo August 27, 2024 07:13
Copy link
Member

@vepadulano vepadulano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this effort! It will be a great addition to RDF's capabilities.


/// Helper function to update the value of an RNTuple's field in the provided entry.
template <typename T>
void SetFieldsHelper(T value, std::string_view fieldName, ROOT::Experimental::REntry *entry)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void SetFieldsHelper(T value, std::string_view fieldName, ROOT::Experimental::REntry *entry)
void SetFieldsHelper(const T &value, std::string_view fieldName, ROOT::Experimental::REntry *entry)

Comment on lines +1888 to +1889
SnapshotRNTupleHelper(const SnapshotRNTupleHelper &) = delete;
SnapshotRNTupleHelper(SnapshotRNTupleHelper &&) = default;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing the assignment operators definition for rule of 5

fOutputEntry = &model->GetDefaultEntry();

fOutFile = std::unique_ptr<TFile>(
TFile::Open(fFileName.c_str(), fOptions.fMode.c_str(), /*ftitle=*/fFileName.c_str() /* , cs */));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the comment about /* , cs */?

std::string fFileName;
std::string fNTupleName;

std::unique_ptr<TFile> fOutFile = nullptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the = is used whereas the fWriter data member is uniform-initialised, I suggest we choose one of the two syntaxes for both.

Comment on lines +1932 to +1939
if (checkupdate == "update") {
fWriter = ROOT::Experimental::RNTupleWriter::Append(std::move(model), fNTupleName, *fOutFile, writeOptions);
} else {
// If the output file is not closed before the RNTupleWriter is created, no actual data will be written and
// the file will be empty.
fOutFile->Close();
fWriter = ROOT::Experimental::RNTupleWriter::Recreate(std::move(model), fNTupleName, fFileName, writeOptions);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the logic here, we are opening a TFile and then checking if the mode is "update". Only if this is true then it makes sense to have the TFile around, otherwise it is the responsibility of the RNTupleWriter to deal with the file opening. In this sense, I'd rather have all the logic related to fOutFile in the if scope, so that we don't need to remember to close it in the else scope.

{
ColumnNames_t parentFields;

std::copy_if(columnNames.cbegin(), columnNames.cend(), std::back_inserter(parentFields),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, here we know that the "." uniquely identifies the fields available in RDataFrame but that are not actual fields of the RNTuple, because in any case the RNTuple field names cannot have a "." character, right?


TEST(RDFSnapshotRNTuple, FromScratchTemplated)
{
const auto filename = "RDFSnapshotRNTuple_from_scratch_templated.root";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and in other tests, we should make sure that the output file is properly cleaned at the end of the test, some RAII file guard should help

{
ROOT_EXPECT_WARNING(BookLazySnapshot(), "Snapshot", "A lazy Snapshot action was booked but never triggered.");
// This returns FALSE if the file IS there
EXPECT_TRUE(gSystem->AccessPathName("lazysnapshotnottriggered_shouldnotbecreated.root"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


auto sdf2 = df.Snapshot("ntuple", "RDFSnapshotRNTuple_inner_fields.root", "jets.electrons");

expected = {"jets_electrons", "jets_electrons.pt"};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was not possible to get an output field name with a "." in it.

EXPECT_EQ(expected, sdf->GetColumnNames());
}

TEST_F(RDFSnapshotRNTupleTest, ComplexFields)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests with complex fields are very nice to have! Can we also include checks for the actual values in the output RNTuple?

@hahnjo
Copy link
Member

hahnjo commented Sep 6, 2024

Can you expand on why we cannot snapshot from TTree to RNTuple? In general, I'm not sure if it's a good idea to change the default behavior depending on the type of the datasource. Ideally, kDefault would for now always mean kTTree but be kind of deprecated from the beginning to get users to choose which snapshot format they want. Then at a later point, we could remove it entirely and force users to choose the wanted format, to eventually re-introduce kDefault in ROOT7 to mean kRNTuple...

{
fWriter.reset();
}
fLoopManager->SetDataSource(std::make_unique<ROOT::Experimental::RNTupleDS>(fNTupleName, fFileName));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's been a while, but why are we replacing the data source? I don't see this done for the existing SnapshotTTreeHelper, but maybe it is hidden somewhere because of how TTree is special in RDF...

I'm asking because I guess this one / the reason for the compatibility matrix in the PR summary. Also I'm not sure if this change of the data source is actually sound, I think with some Redefines I can construct cases where filters are not idempotent and thus a snapshot + replacing the data source might change results.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer myself: Snapshot directly calls CreateLMFromTTree with a non-existing file name that is then later written to during the action. See also Vincenzo's comment https://github.com/root-project/root/pull/15750/files#r1741566386

if (outObj->InheritsFrom("TTree")) {
static_cast<TTree *>(outObj)->Delete("all");
} else {
outFile->Delete(ntupleName.c_str());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here and other outFile->Delete calls you might mean:

Suggested change
outFile->Delete(ntupleName.c_str());
outFile->Delete((ntupleName + ";*").c_str());

otherwise the TTree case is the only one deleting older cycles.


public:
using ColumnTypes_t = TypeList<ColTypes...>;
SnapshotRNTupleHelper(std::string_view filename, std::string_view ntuplename, const ColumnNames_t &vfnames,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we intentionally not support storing the snapshot into a sub-directory?

};

#ifdef R__HAS_ROOT7
void ValidateSnapshotRNTupleOutput(const RSnapshotOptions &opts, const std::string &ntupleName,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is missing a doc string and the name seems at odd with the functionality. Most often a validate routine would return a 'bool' and leave its input unchanged (eg. IsValid)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants