Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hash join and more support for strings value types in kernel functions #926

Conversation

saminbassiri
Copy link
Contributor

Add Hash Join and enhance kernel functions FilterRow and EwBinary to handle String Value Types.

Description:

This pull request introduces enhancements to the system by adding a new hashJoin operation and extending support for string value types in kernel functions.

Key Changes:

  1. Hash Join Implementation:

    • Added hashJoin as a kernel function.
    • Integrated hashJoin as a built-in function in DaphneDSL.
    • Created HashJoinOp in DaphneIR.
  2. String Value Support:

    • Enhanced kernel functions to handle string values.
    • Updated FilterRow kernel function for compatibility with string values.
    • Introduced EwBinary operations to compare char* strings with std::string .
    • Updated kernel.json to reflect the new string value support.

Next Steps:

Add tests to validate correct behavior across various scenarios.

@pdamme pdamme self-requested a review December 3, 2024 21:32
pdamme pushed a commit that referenced this pull request Dec 12, 2024
- EwBinaryOpCode: Updated specification which elementwise binary ops are supported on which string value types (std::string, FixedStr16, const char *).
  - Needed to generalize some macros from one common value type for both operands to individual value types, e.g., to support "matrix of std::string op string scalar".
- Fixed a small type bug in EwBinaryObjSca-kernel.
- ExtractRow- and FilterRow-kernels on frames work with frames with std::string columns.
- Added instantiations of EwBinaryObjSca-kernel for comparisons of string matrices with string scalar (integer result).
- Order- and Group-kernels can handle frames with std::string columns.
- These changes were originally included in PRs #918, #921, and #926 by @saminbassiri.
- @pdamme is committing them in the name of @saminbassiri (for correct attribution) in a separate commit, since they don't really fit the topics of those PRs.
@pdamme
Copy link
Collaborator

pdamme commented Dec 12, 2024

I committed the changes not directly related to the topic of this PR in commit a374814.

saminbassiri and others added 3 commits December 12, 2024 20:39
- Add hash join kernel function.
- Add hashJoin built-in function to DaphneDSL.
- Create HashJoinOp in DaphneIR.
- Instead of adding the hash-join throughout the DAPHNE stack (DaphneDSL built-in function, DaphneIR op plus inference etc., runtime kernel), we should use the hash-join code as the implementation of the innerJoin-kernel.
- So far, the innerJoin-kernel was anyway extremely inefficient and had numerous memory leaks related to Frame::getColumn().
- Consequently, I removed the hashJoin built-in/op/kernel and move the kernel code to the innerJoin-kernel.
- Also reverted changes not directly related to this PR, they have already been contributed to main in commit a374814.
…-kernel.

- Fixed some memory leaks: matrices obtained through Frame::getColumn() must be destroyed explicitly.
- Ensured that not only n:1-joins but also n:m-joins are supported (modified the existing unit test case for the innerJoin-kernel to trigger such a case).
- Added license header (was already missing before this PR).
@pdamme pdamme force-pushed the Add-hashJoin-and-more-support-for-strings-value-types branch from c2f9cec to 39ee4d5 Compare December 12, 2024 20:08
Copy link
Collaborator

@pdamme pdamme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution @saminbassiri. A hash-based join implementation was long overdue in DAPHNE. You added the hash-join throughout the DAPHNE stack (DaphneDSL built-in function, DaphneIR op plus inference, runtime kernel).

While this integration looks good from a technical point of view, I think we should not expose the physical join variant to users. Instead, it's much more useful to replace the existing innerJoin-kernel (which was very inefficient and had memory leaks) by your hash-based implementation. Consequently, I removed the new hashJoin() built-in and HashJoinOp.

I fixed a few little things in the hash-join code: memory leaks related to Frame::getColumn() and made the kernel support n:m-joins (by using a std::unordered_map<VT, std::vector<size_t>> for the hash table). See my commits for details.

With that, this PR is ready to be merged.

@pdamme pdamme merged commit bd1f04b into daphne-eu:main Dec 12, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants