Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change dot notation in add column documentation to tuple #1433

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jeppe-dos
Copy link

A tuple must be used to make columns in structs as described in add_column:
"Because "." may be interpreted as a column path separator or may be used in field names, it is not allowed to add nested column by passing in a string. To add to nested structures or to add fields with names that contain "." use a tuple instead to indicate the path."
This PR corrects the documentation to use tuples instead of dot notation.

From issue 1407

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jeppe-dos for fixing this 🙌

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there might be a bug with this change. I tried to follow the docs

from pyiceberg.catalog.sql import SqlCatalog
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType, DoubleType, LongType

warehouse_path = "/tmp/warehouse"
catalog = SqlCatalog(
    "default",
    **{
        "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
        "warehouse": f"file://{warehouse_path}",
    },
)

schema = Schema(
    NestedField(1, "city", StringType(), required=False),
    NestedField(2, "lat", DoubleType(), required=False),
    NestedField(3, "long", DoubleType(), required=False),
)
catalog.create_namespace_if_not_exists("default")
try:
    catalog.drop_table("default.locations")
except:
    pass

table = catalog.create_table("default.locations", schema)

# with table.update_schema() as update:
#     # In a struct
#     update.add_column("details.confirmed_by", StringType(), "Name of the exchange")

with table.update_schema() as update:
    update.add_column(("details", "confirmed_by"), StringType(), "Name of the exchange")

errors


Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Users/kevinliu/repos/iceberg-python/pyiceberg/table/update/schema.py", line 192, in add_column
    parent_field = self._schema.find_field(parent_full_path, self._case_sensitive)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kevinliu/repos/iceberg-python/pyiceberg/schema.py", line 215, in find_field
    raise ValueError(f"Could not find field with name {name_or_id}, case_sensitive={case_sensitive}")
ValueError: Could not find field with name details, case_sensitive=True

@kevinjqliu
Copy link
Contributor

Heres where the errors happens

name = path[-1]
parent = path[:-1]
full_name = ".".join(path)
parent_full_path = ".".join(parent)
parent_id: int = TABLE_ROOT_ID
if len(parent) > 0:
parent_field = self._schema.find_field(parent_full_path, self._case_sensitive)

And some debugging statements:

(Pdb) path
('details', 'confirmed_by')
(Pdb) name
'confirmed_by'
(Pdb) parent
('details',)
(Pdb) parent_full_path
'details'
(Pdb) parent_id
-1
(Pdb) len(parent) > 0
True
parent_field = self._schema.find_field(parent_full_path, self._case_sensitive)

is where it errors.

Seems like we're missing the case where no "parent" is present

@jeppe-dos
Copy link
Author

Looks like there might be a bug with this change. I tried to follow the docs

from pyiceberg.catalog.sql import SqlCatalog
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType, DoubleType, LongType

warehouse_path = "/tmp/warehouse"
catalog = SqlCatalog(
    "default",
    **{
        "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
        "warehouse": f"file://{warehouse_path}",
    },
)

schema = Schema(
    NestedField(1, "city", StringType(), required=False),
    NestedField(2, "lat", DoubleType(), required=False),
    NestedField(3, "long", DoubleType(), required=False),
)
catalog.create_namespace_if_not_exists("default")
try:
    catalog.drop_table("default.locations")
except:
    pass

table = catalog.create_table("default.locations", schema)

# with table.update_schema() as update:
#     # In a struct
#     update.add_column("details.confirmed_by", StringType(), "Name of the exchange")

with table.update_schema() as update:
    update.add_column(("details", "confirmed_by"), StringType(), "Name of the exchange")

errors


Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Users/kevinliu/repos/iceberg-python/pyiceberg/table/update/schema.py", line 192, in add_column
    parent_field = self._schema.find_field(parent_full_path, self._case_sensitive)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kevinliu/repos/iceberg-python/pyiceberg/schema.py", line 215, in find_field
    raise ValueError(f"Could not find field with name {name_or_id}, case_sensitive={case_sensitive}")
ValueError: Could not find field with name details, case_sensitive=True

Yes, the struct has to exist before you can insert anything into it. This can be adjusted in the code to automatically create the parent. For now, it is detailed in the documentation changes. Should I write more explicitly?

@kevinjqliu
Copy link
Contributor

kevinjqliu commented Dec 20, 2024

Yes, the struct has to exist before you can insert anything into it.

ah i see, that makes sense. in that case, can we edit the example so that it works out of the box?

Also i think its valuable to move the comment to the top level docs of "Add Column". We can include both the details about dot notation and struct parent

@kevinjqliu
Copy link
Contributor

i found another dot notion in Move column, do we need to change this too?
https://py.iceberg.apache.org/api/#move-column

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants