Apache Iceberg Tables Integration for big data access #1541

ma3u · 2024-09-09T09:03:55Z

ma3u
Sep 9, 2024
Collaborator

Upstream discussion we already discussed compute2data:

In many dataspaces we need to analyse IoT data on edge. We also call this approachcompute2data or zerocopy.

We don’t want to send big data sets or raw streaming data to the cloud on regulation reasons. The EDC connector is able to do it now how this Kafka example shows.

The aim is to process data where it resides, rather than moving large datasets to separate compute resources. That’s improving cost efficiency and performance by reducing data movement and enhance data security and privacy by keeping sensitive data in its original location.

Especially the Manufacturing-X family aims to implement a federated, decentralized and collaborative data ecosystem for smart manufacturing. It recognizes that deployment needs customization across an infrastructure continuum from cloud to edge, depending on the applications. Manufacturing-X aims to address cross-industry use cases based on collaborative use of data, which would likely involve integrating cloud and edge computing capabilities. EdgeComputing is seen as a key trend reshaping manufacturing, allowing data processing closer to the source and enabling real-time insights. It’s enabling manufacturers to filter data, reduce server overload, and perform local data analysis in real-time.

So why we need Apache Iceberg Table support?

Apache Iceberg is emerging as a quasi-standard for many popular data platforms in cloud environments, offering features that enhance data management and analytics at scale.

It’s a quasi standard for the most data platforms:

Snowflake: Specializes in cloud-based data warehousing
Databricks: Excels in real-time data processing and machine learnin and Built on Apache Spark, enabling large-scale data processing
Google Cloud has announced Iceberg support for BigLake
Amazon Web Services (AWS) mentions Iceberg as a solution for transactional data lakes
Cloudera offers Iceberg as part of their open data lakehouse solution
Microsoft has announced plans to support Iceberg in OneLake in Fabric
Iceberg is designed to work with various data processing engines and storage systems, including Spark, Trino, Flink, Presto, Hive and Impala

Large technology companies like Netflix and Apple were involved in Iceberg's creation, and it's being deployed by some of the largest technology companies. Here you can find an article what Apache Iceberg means for the data community. An Iceberg Extension would help to increase the acceptance of the EDC connector and data spaces in general.

Apache Iceberg provides features that enable efficient data processing where the data is stored, which can be seen as a way of bringing compute closer to the big data like time travel and incremental queries.

Integrating Apache Iceberg as an extension in EDC would require custom development, as there isn't a direct out-of-the-box integration between these two technologies. This integration would allow EDC to use Iceberg tables as data sources and sinks, leveraging Iceberg's features like schema evolution and time travel within the context of data spaces.

This workshop show how to setup a local S3 compliant datalake] to store the IoT data with Minio. And installs a single node Apache Iceberg processing engine and lay the groundwork for the support of our Apache Iceberg tables and catalog.

Install a single node Apache Iceberg processing engine and lay the groundwork for the support of our Apache Iceberg tables and catalog: Implementation steps in EDC with Iceberg:

Create a custom DataSource that exposes Iceberg table data through EDC's API
Implement query capabilities to allow consumers to request specific data
Use EDC's contract negotiation process to manage access rights
On the consumer side, implement a DataSink that pulls data and writes it to Iceberg tables #PullTransfer or writes received data to Iceberg tables #PushTransfer according of the supported transfer methods of the dataspace protocol

More information to Iceberg you find here: https://iomete.com/the-ultimate-guide-to-apache-iceberg

ma3u · 2024-09-26T11:39:46Z

ma3u
Sep 26, 2024
Collaborator Author

@AndrYurk

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apache Iceberg Tables Integration for big data access #1541

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Apache Iceberg Tables Integration for big data access #1541

ma3u Sep 9, 2024 Collaborator

Replies: 1 comment

ma3u Sep 26, 2024 Collaborator Author

ma3u
Sep 9, 2024
Collaborator

ma3u
Sep 26, 2024
Collaborator Author