-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using .
as a delimiter in field names of ak.Records
leads to wrong report of necessary columns
#552
Comments
Yep, it's a problem. The fields should probably always be tuples of strings everywhere. @jpivarski suggested using The complexity comes in how we can pass these to the backend column selection, and if/where "*" is allowed. For example, pyarrow does accept and expect "." delimited strings. All of this complexity needs to be encoded in the encoding of whatever the backend gives to form-keys and back again (at load selection time). We care abot parquet, root and JSON, which each have quirks. |
(it's worth mentioning that most parquet frameworks prevent field names with various special characters, but this is not actually part of the spec) |
A good convention could be to set form keys to a tuple representing a path from root to each node (easy to set up in a recursive algorithm), with strings representing RecordArray field names, integers for tuple and UnionArray slots, and None for all other nesting (lists, option-types, indexed arrays). The path wouldn't encode the specifics of which type of nodes are at each level—you have the form for that—but it would indicate which branch you took. Form keys have to be strings, so you could take the repr, which can be reconstituted with ast.literal_eval. >>> tuple_of_things = ("fieldname", 2, None, "another")
>>> form_key = repr(tuple_of_things)
>>> form_key
"('fieldname', 2, None, 'another')"
>>> tuple_again = ast.literal_eval(form_key)
>>> tuple_again
('fieldname', 2, None, 'another') If you only cared about uniqueness, there's already a utility for that in Awkward: Content.form_with_key. But this came up because @pfackeldey wants to go from a specific node to the nested field names that lead to that node. If the form keys were only unique and the form is available, it would be possible to find that path, but only by doing a recursive tree-search every time. It's nice to have this path encoded in the form key itself so that you don't have to do a search. Speaking of which, that might be better as an Awkward utility, alongside Content.form_with_key. |
Maybe this would be useful? scikit-hep/awkward#3311 If so, it needs some tests. |
Using
.
within field names ofak.Records
is a valid thing to do..
is also commonly used in parquet's schemas or ROOT TBranch names, which are translated directly into field names of RecordArrays.Currently, dask-awkward reports the same necessary columns for the following two cases:
Maybe alternatively the
frozenset
should contain("pt", "foo")
in the first case ("nested record") and("pt.foo",)
in the second case ("field name with.
"), so these cases are ambiguous?The text was updated successfully, but these errors were encountered: