Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hdfs support #767

Closed
wants to merge 3 commits into from
Closed

Conversation

aristotelis96
Copy link
Collaborator

This PR introduces support for Hadoop Distributed File System (HDFS).

  • Added dependency libhdfs3 (Native C/C++ HDFS Client)
  • Updated container.
  • Kernels for reading and writing to HDFS in Daphne binary format and csv format.
  • Distributed kernels to support parallel read/write for the distributed runtime (only synchronous gRPC for now).
  • Added cli arguments (and updated config file).

CMakeLists.txt Outdated Show resolved Hide resolved
Copy link
Collaborator

@corepointer corepointer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature is looking good in general. A few things could be improved (see code comments). I did not test-run the code but tried compilation of the dependencies to check support in the Docker images.
Please make the integration optional like CUDA or MPI.

build.sh Show resolved Hide resolved
src/runtime/local/io/HDFS/ReadHDFS.h Show resolved Hide resolved
@aristotelis96
Copy link
Collaborator Author

Thank you for the valuable feedback @corepointer. I will address the comments and make HDFS build optional. 👍

corepointer added a commit to corepointer/daphne that referenced this pull request Sep 12, 2024
This commit adds the necessary packages for the container scripts and code in the build script to build the dependencies for HDFS support.
corepointer added a commit to corepointer/daphne that referenced this pull request Sep 12, 2024
This commit adds the necessary packages for the container scripts and code in the build script to build the dependencies for HDFS support.
This commit caters to the ongoing discussion in GH issue daphne-eu#825 and changes the Docker container scripts to build upon Ubuntu 24.
corepointer added a commit to corepointer/daphne that referenced this pull request Sep 13, 2024
This commit adds the necessary packages for the container scripts and code in the build script to build the dependencies for HDFS support.

Co-authored-by: Mark Dokter <[email protected]>
corepointer added a commit to corepointer/daphne that referenced this pull request Sep 14, 2024
This commit adds the necessary packages for the container scripts and code in the build script to build the dependencies for HDFS support.

Co-authored-by: Mark Dokter <[email protected]>
@aristotelis96
Copy link
Collaborator Author

Thank you @corepointer again for your review. I've added an optional --hdfs build argument similar to CUDA and MPI.

corepointer added a commit to corepointer/daphne that referenced this pull request Sep 18, 2024
This commit adds the necessary packages for the container scripts and code in the build script to build the dependencies for HDFS support.

Co-authored-by: Mark Dokter <[email protected]>
psomas and others added 2 commits September 18, 2024 20:20
This commit adds the necessary packages for the container scripts and code in the build script to build the dependencies for HDFS support.

Co-authored-by: Mark Dokter <[email protected]>
This commit adds initial support to read and write files from Hadoop Filesystems in distributed mode. Besides the read, write and distributed functionality, this also contains new configuration options and a new context object to manage the connection information to the distributed filesystem. Finally, this feature requires the installation of more external dependencies. The compilation is therefore optional and can be activated with the --hdfs flag to build.sh.

Closes daphne-eu#767

Co-authored-by: KostasBitsakos <[email protected]>
Co-authored-by: Mark Dokter <[email protected]>
@corepointer
Copy link
Collaborator

corepointer commented Sep 18, 2024

LGTM - thx for your contribution @aristotelis96 @psomas @KostasBitsakos

  • I had to fix a few things to make everything compile and squashed into two commits for dependencies and feature.
  • The changes for the Docker scripts should have gone to main before merging the PR - this is why these changes now show up as part of the PR. The final result is the same, just the presented changes here might be confusing.
  • The PR should be marked as "merged" and not "closed" - don't know why that sometimes goes wrong.

@corepointer
Copy link
Collaborator

Test cases would be nice ;-)
I could imagine an automated setup of a mini single-node cluster that reads from localhost [1].
Might be a nice student project :D

[1] https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html

@corepointer corepointer added feature missing/requested features Distributed Issues and PRs related to distributed computation labels Sep 18, 2024
@corepointer corepointer added this to the v0.4 milestone Sep 18, 2024
This was referenced Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Distributed Issues and PRs related to distributed computation feature missing/requested features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants