-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DF] Add initial implementation for snapshotting to RNTuple #15750
base: master
Are you sure you want to change the base?
Conversation
9097b0a
to
6fc819e
Compare
Test Results 13 files 13 suites 2d 18h 49m 0s ⏱️ Results for commit fbc8f74. ♻️ This comment has been updated with latest results. |
c086453
to
3c28455
Compare
To be able to snapshot to RNTuple, we need to be able to initialize the data source *after* the loop manager has been created. As an effect, `RInterface` cannot statically store a (raw) pointer to this data source anymore, so instead we introduce a getter that returns the loop manager's current data source.
For snapshotting RNTuples, we need to include the header file for RNTupleDS in `ActionHelpers.hxx`. To avoid dependency conflicts related to including `ROOT/RDataFrame.hxx`, we move the free `FromRNTuple` functions to a separate header.
3c28455
to
c86bee3
Compare
c86bee3
to
932771a
Compare
This initial implemenation only supports single-threaded snapshotting. When the original datasource is an RNTuple, the resulting snapshot will be an RNTuple by default. This can be changed to TTree through the RSnapshotOptions. Snapshotting from TTree to RNTuple is not yet supported in this version, but will be added in the future. Snapshotting from other data sources or dataframes created from scratch to RNTuple is supported.
932771a
to
fbc8f74
Compare
Necessary for Snapshot-related tests, now that snapshotting to RNTuple is being added (root-project/root#15750).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this effort! It will be a great addition to RDF's capabilities.
|
||
/// Helper function to update the value of an RNTuple's field in the provided entry. | ||
template <typename T> | ||
void SetFieldsHelper(T value, std::string_view fieldName, ROOT::Experimental::REntry *entry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void SetFieldsHelper(T value, std::string_view fieldName, ROOT::Experimental::REntry *entry) | |
void SetFieldsHelper(const T &value, std::string_view fieldName, ROOT::Experimental::REntry *entry) |
SnapshotRNTupleHelper(const SnapshotRNTupleHelper &) = delete; | ||
SnapshotRNTupleHelper(SnapshotRNTupleHelper &&) = default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing the assignment operators definition for rule of 5
fOutputEntry = &model->GetDefaultEntry(); | ||
|
||
fOutFile = std::unique_ptr<TFile>( | ||
TFile::Open(fFileName.c_str(), fOptions.fMode.c_str(), /*ftitle=*/fFileName.c_str() /* , cs */)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the comment about /* , cs */
?
std::string fFileName; | ||
std::string fNTupleName; | ||
|
||
std::unique_ptr<TFile> fOutFile = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the =
is used whereas the fWriter
data member is uniform-initialised, I suggest we choose one of the two syntaxes for both.
if (checkupdate == "update") { | ||
fWriter = ROOT::Experimental::RNTupleWriter::Append(std::move(model), fNTupleName, *fOutFile, writeOptions); | ||
} else { | ||
// If the output file is not closed before the RNTupleWriter is created, no actual data will be written and | ||
// the file will be empty. | ||
fOutFile->Close(); | ||
fWriter = ROOT::Experimental::RNTupleWriter::Recreate(std::move(model), fNTupleName, fFileName, writeOptions); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC the logic here, we are opening a TFile and then checking if the mode is "update". Only if this is true then it makes sense to have the TFile around, otherwise it is the responsibility of the RNTupleWriter to deal with the file opening. In this sense, I'd rather have all the logic related to fOutFile
in the if
scope, so that we don't need to remember to close it in the else
scope.
{ | ||
ColumnNames_t parentFields; | ||
|
||
std::copy_if(columnNames.cbegin(), columnNames.cend(), std::back_inserter(parentFields), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm, here we know that the "."
uniquely identifies the fields available in RDataFrame but that are not actual fields of the RNTuple, because in any case the RNTuple field names cannot have a "."
character, right?
|
||
TEST(RDFSnapshotRNTuple, FromScratchTemplated) | ||
{ | ||
const auto filename = "RDFSnapshotRNTuple_from_scratch_templated.root"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and in other tests, we should make sure that the output file is properly cleaned at the end of the test, some RAII file guard should help
{ | ||
ROOT_EXPECT_WARNING(BookLazySnapshot(), "Snapshot", "A lazy Snapshot action was booked but never triggered."); | ||
// This returns FALSE if the file IS there | ||
EXPECT_TRUE(gSystem->AccessPathName("lazysnapshotnottriggered_shouldnotbecreated.root")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably nicer to use https://en.cppreference.com/w/cpp/filesystem/exists
|
||
auto sdf2 = df.Snapshot("ntuple", "RDFSnapshotRNTuple_inner_fields.root", "jets.electrons"); | ||
|
||
expected = {"jets_electrons", "jets_electrons.pt"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was not possible to get an output field name with a "."
in it.
EXPECT_EQ(expected, sdf->GetColumnNames()); | ||
} | ||
|
||
TEST_F(RDFSnapshotRNTupleTest, ComplexFields) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests with complex fields are very nice to have! Can we also include checks for the actual values in the output RNTuple?
Can you expand on why we cannot snapshot from TTree to RNTuple? In general, I'm not sure if it's a good idea to change the default behavior depending on the type of the datasource. Ideally, |
{ | ||
fWriter.reset(); | ||
} | ||
fLoopManager->SetDataSource(std::make_unique<ROOT::Experimental::RNTupleDS>(fNTupleName, fFileName)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's been a while, but why are we replacing the data source? I don't see this done for the existing SnapshotTTreeHelper
, but maybe it is hidden somewhere because of how TTree is special in RDF...
I'm asking because I guess this one / the reason for the compatibility matrix in the PR summary. Also I'm not sure if this change of the data source is actually sound, I think with some Redefine
s I can construct cases where filters are not idempotent and thus a snapshot + replacing the data source might change results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To answer myself: Snapshot
directly calls CreateLMFromTTree
with a non-existing file name that is then later written to during the action. See also Vincenzo's comment https://github.com/root-project/root/pull/15750/files#r1741566386
if (outObj->InheritsFrom("TTree")) { | ||
static_cast<TTree *>(outObj)->Delete("all"); | ||
} else { | ||
outFile->Delete(ntupleName.c_str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here and other outFile->Delete
calls you might mean:
outFile->Delete(ntupleName.c_str()); | |
outFile->Delete((ntupleName + ";*").c_str()); |
otherwise the TTree
case is the only one deleting older cycles.
|
||
public: | ||
using ColumnTypes_t = TypeList<ColTypes...>; | ||
SnapshotRNTupleHelper(std::string_view filename, std::string_view ntuplename, const ColumnNames_t &vfnames, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we intentionally not support storing the snapshot into a sub-directory?
}; | ||
|
||
#ifdef R__HAS_ROOT7 | ||
void ValidateSnapshotRNTupleOutput(const RSnapshotOptions &opts, const std::string &ntupleName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function is missing a doc string and the name seems at odd with the functionality. Most often a validate
routine would return a 'bool' and leave its input unchanged (eg. IsValid
)
This PR adds a first iteration of snapshotting to RNTuple from an RDataFrame. It uses the existing
Snapshot
interface, with an addition toRSnapshotOptions
,kOutputFormat
. This option can be set to write to either TTree, RNTuple, or take the default choice. The table below describes howSnapshot
behaves accoring to the output format option:ESnapshotOutputFormat::kDefault
ESnapshotOutputFormat::kTTree
ESnapshotOutputFormat::kDefault
RNTupleImporter
ESnapshotOutputFormat::kDefault
ESnapshotOutputFormat::kRNTuple
Implementation
As mentioned, the existing
Snapshot
interface is used. A newSnapshotRNTupleHelper
has been created to handle the creation and writing of the RNTuple, akin to the existingSnapshotHelper
(which has been renamed toSnapshotTTreeHelper
for consistency).RLoopManager data source initialization (rev bbf221f)
The snapshot action creates a new loop manager which manages the snapshotted data set. The loop manager gets initialized before the actual snapshotting takes place. Originally, the pointer to the data source owned by the loop manager was marked as
const
. Because the RNTuple's data source has to be created after the loop manager, for this PR theconst
qualifier has been dropped.Move
ROOT::RDF::Experimental::FromRNTuple
(rev 0a29b02)For snapshotting RNTuples, we need to include the header file for RNTupleDS in
ActionHelpers.hxx
. To avoid dependency conflicts related to includingROOT/RDataFrame.hxx
, the freeFromRNTuple
functions have been moved to a separate header.Current limitations and follow-ups
This PR adds the minimal functionality for (single-threaded) snapshotting to RNTuple. A number of follow-ups are foreseen:
RNTuple write options
Currently no RNTuple-specific write options have been added to
RSnapshotOptions
yet, except for compression settings which were already present as an option. Adding (a subset) of the otherRNTupleWriteOptions
is trivial.Default compression settings
RSnapshotOptions
' default compression setting is 101 (Zlib). However, RNTuple's default compression setting is 505 (zstd). We could change the default compression setting tokInherit
and decide which settings to use according to the target data format (unless explicitly set by the user, of course).Multithreaded snapshotting
This PR only adds single-threaded RNTuple snapshotting. Multithreaded (and parallel) snapshotting will be addressed in a follow-up PR.
Tests
Corresponding
roottest
PR: root-project/roottest#1178Tests for Windows have been disabled, due to permission denied-errors related to trying to recreate currently open TFiles. The regular snapshot tests have also been disabled for Windows, presumably for the same reason.