From 01d83fa76a6686f17223e05ba02e1bd232799903 Mon Sep 17 00:00:00 2001 From: Daniel Kavan Date: Mon, 2 Aug 2021 11:10:49 +0200 Subject: [PATCH] #99 atum sdk s3 extension examples README.md, typos for core examples README.md --- examples/atum-examples/README.md | 6 +- examples/s3-sdk-extension-examples/README.md | 60 ++++++++++++++++++++ 2 files changed, 63 insertions(+), 3 deletions(-) create mode 100644 examples/s3-sdk-extension-examples/README.md diff --git a/examples/atum-examples/README.md b/examples/atum-examples/README.md index 7c794338..75f65a62 100644 --- a/examples/atum-examples/README.md +++ b/examples/atum-examples/README.md @@ -1,7 +1,7 @@ # Atum Spark Job Application Example This is a set of Atum Apache Spark Applications that can be used as inspiration for creating other -Spark projects. It includes all dependencies in a 'fat' jar to run the job locally and on cluster. +Spark projects. It includes all dependencies in a 'fat' jar to run the job locally and on a cluster. Here is the list of examples (all from `za.co.absa.atum.examples` space): @@ -27,7 +27,7 @@ mvn package -DskipTests=true ``` ## Scala and Spark version switching -Same as Atum itself, the example project also support switching to build with different Scala and Spark version: +Same as Atum itself, the example project also supports switching to build with different Scala and Spark version: Switching Scala version (2.11 or 2.12) can be done via ```shell script @@ -45,7 +45,7 @@ mvn clean install -Pspark-3.1 ## Running via spark-submit After the project is packaged you can copy `target/2.11/atum-examples_2.11-0.0.1-SNAPSHOT.jar` -to an edge node of a cluster and use `spark-submit` to run the job. Here us an example when running on Yarn: +to an edge node of a cluster and use `spark-submit` to run the job. Here is an example when running on Yarn: ```shell script spark-submit --master yarn --deploy-mode client --class za.co.absa.atum.examples.SampleMeasurements1 atum-examples_2.11-0.0.1-SNAPSHOT.jar diff --git a/examples/s3-sdk-extension-examples/README.md b/examples/s3-sdk-extension-examples/README.md new file mode 100644 index 00000000..dc50cd43 --- /dev/null +++ b/examples/s3-sdk-extension-examples/README.md @@ -0,0 +1,60 @@ +# SDK-S3 Atum Spark Job Application Example + +This is a set of Atum Apache Spark Applications (using the SDK S3 Atum Extension) that can be used as inspiration for creating other +Spark projects. It includes all dependencies in a 'fat' jar to run the job locally and on a cluster. + +- `SampleSdkS3Measurements{1|2}` - Example apps using Atum SDK S3 Extension to show the Atum initialization, +checkpoint setup and the resulting control measure handling (in the form of `_INFO` file originating and lading to AWS S3) + +## Usage + +The example application is in `za.co.absa.atum.examples` package. The project contains build files for `Maven`. + +## Maven +**To build an uber jar to run on cluster** +```shell script +mvn package -DskipTests=true +``` + +## Scala and Spark version switching +Same as Atum itself, the example project also supports switching to build with different Scala and Spark version: + +Switching Scala version (2.11 or 2.12) can be done via +```shell script +mvn scala-cross-build:change-version -Pscala-2.11 # this is default +# or +mvn scala-cross-build:change-version -Pscala-2.12 +``` + +Choosing a spark version to build, there are `spark-2.4` and `spark-3.1` profiles: +```shell script +mvn clean install -Pspark-2.4 # this is default +mvn clean install -Pspark-3.1 +``` + +## Running Requirements +Since these example apps demonstrate cooperation with S3 resources, a number of environment prerequisites must be met +for the code to be truly runnable. Namely: + - having a AWS profile named `saml` in `~/.aws/credentials` + - having your bucket defined in `TOOLING_BUCKET_NAME` and your KMS Key ID in `TOOLING_KMS_KEY_ID` + (the example is written to enforce AWS-KMS server-side encryption) + +## Running via spark-submit + +After the project is packaged you can copy `target/2.11/atum-examples-s3-sdk-extension_2.11-0.0.1-SNAPSHOT` +to an edge node of a cluster and use `spark-submit` to run the job. Here is an example when running on Yarn: + +```shell script +spark-submit --master yarn --deploy-mode client --class za.co.absa.atum.examples.SampleSdkS3Measurements1 atum-examples-s3-sdk-extension_2.11-0.0.1-SNAPSHOT.jar +``` + +### Running Spark Applications in local mode from an IDE +If you try to run the example from an IDE you'll likely get the following exception: +```Exception in thread "main" java.lang.NoClassDefFoundError: scala/Option``` + +This is because the jar is created with all Scala and Spark dependencies removed (using shade plugin). This is done so that the uber jar for `spark-submit` is not too big. + +There are multiple options to deal with it, namely: + - use the test runner class, for the SampleSdkS3Measurements, it is `SampleMeasurementsS3RunnerExampleSpec` (provided dependencies will be loaded for tests) + - use the _Include dependencies with "Provided" scope_ option in Run Configuration in IDEA or equivalent in your IDE. + - change the scope of `provided` dependencies to `compile` in the POM file and run Spark Applications as a normal JVM App.