-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Count rows as a metadata only operation #1388
base: main
Are you sure you want to change the base?
Changes from 7 commits
731542e
c6c971e
da18837
3104a2f
c2740ea
90bca84
f7202b9
c7205b3
09f9c10
1e9da22
3ab20d4
091c0af
3cd797d
6b0924e
96cb4e9
212c83b
8bc65fa
8bb039f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1227,6 +1227,9 @@ def filter(self: S, expr: Union[str, BooleanExpression]) -> S: | |
def with_case_sensitive(self: S, case_sensitive: bool = True) -> S: | ||
return self.update(case_sensitive=case_sensitive) | ||
|
||
@abstractmethod | ||
def count(self) -> int: ... | ||
|
||
|
||
class ScanTask(ABC): | ||
pass | ||
|
@@ -1493,6 +1496,13 @@ def to_ray(self) -> ray.data.dataset.Dataset: | |
|
||
return ray.data.from_arrow(self.to_arrow()) | ||
|
||
def count(self) -> int: | ||
res = 0 | ||
tasks = self.plan_files() | ||
for task in tasks: | ||
res += task.file.record_count | ||
return res | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I love this approach! My only concern is about loading too much data into memory at once, although this is loading just one file at a time, in the worst case some file could potentially be very large? Shall we define a threshold and check, for example, if
https://github.com/apache/iceberg-python/blob/main/pyiceberg/table/__init__.py#L1541-L1564 |
||
|
||
|
||
@dataclass(frozen=True) | ||
class WriteTask: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this count will not be accurate when there are deletes files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kevinjqliu thank you for the review. I am trying to account for positional deletes, do you have a suggestion on how that can be achieved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this can be widely off, not just because the merge-on-read deletes, but because
plan_files
returns all the files that (might) contain relevant rows. For example, if it cannot be determined if has relevant data, it will be returned byplan_files
.I think there are two ways forward:
task.file.record_count
. We would need to extend this to also see if there are also merge-on-read deletes as Kevin already mentioned, or just fail when there are positional deletes.residual-predicate
in theFileScanTask
. When we run a query, likeday_added = 2024-12-01 and user_id = 10
, then theday_added = 2024-12-01
might be satisfied with the partitioning already. This is the case when the table is partitioned by day, and we know that all the data in the file evaluatestrue
forday_added = 2024-12-01
, then we need to open the file, and filter foruser_id = 10
. If we would leave out theuser_id = 10
, then it would beALWAYS_TRUE
, and then we know that we can just usetask.file.record_count
. This way we could very easily loop over the.plan_files()
:To get to the second step, we first have to port the
ResidualEvaluator
. The java code can be found here, including some excellent tests.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Fokko I have added Residual Evaluator with Tests.
Now I am trying to create the breaking tests for count where delete has occurred and the counts should differ