idbroker-demo

datalake, role maping, personas and activity

Data Lake Bucket Structure

CLDR-CDP-DL-1
/
|-INPUT
	|- tweets
		|- summary
			|- 2019
				|- 05
					|- 01...
					|- 10
						389457483kkjkljk
|- OUTPUT
	|- shared
		|- tweets
			|- summary
				|- 2019
					|- 05
						|- 01...
						|- 10
							389457483kkjkljk
		|- sentiments
			|- data
			|- logs
		|- financial-trends
			|- data
			|- logs
		|- insurance
			|- FL
		|- teams
			|- team-A
				|- data
	|- hive
		|- tables
		|- logs
	|- spark ## needs access to hive data
		|- ???
|- BACKUPS

Roles to Protect Data Lake Structure

INGEST ROLE - read-write INPUT only
SCIENTEST ROLE - read only INPUT - read-write OUTPUT/shared/sentiments&&financial-trends
TEAM-A SCIENTEST ROLE - read only INPUT - read-write OUTPUT/shared/sentiments&&financial-trends+teams/team-A
TABULAR-APP-ROLE (hive) - read-write OUTPUT/shared/** + read-write OUTPUT/hive/**
BACKUP-SOURCE-ROLE
BACKUP-DEST-ROLE

HDF ingest of twitter data

nifi-ingest user: 
Nifi-Ingest
xxxxxxx
XXXXxxxxxXXXXxxxxxXXXXX
given the LJM-CDP-INGEST-ROLE

shell based users

nifi:cloudera with mapping of ingest group -> LJM-CDP-INGEST-ROLE

lmccay:cloudera with mapping of lmccay user -> LJM-CDP-DATA-SCIENTEST

systest:cloudera with mapping of systest user -> LJM-CDP-QE-DATA-SCIENTEST

bdr:cloudera with mapping of backup-source group -> LJM-CDP-BACKUP-SOURCE-ROLE, LJM-CDP-BACKUP-DEST-ROLE

Usecases:

No access without kinit as a mapped identity
Ingest user - Read/Write access to INPUT paths of data lake bucket - no access to OUTPUT paths
- external user with INGESTER Role
- cluster user that maps to INGESTER Role
Data Scientest user - Read access to INPUT paths and Read/Write access to OUTPUT paths
- copy tweets to OUTPUT
- fail to write to INPUT
Backup-source user - Read only access to INPUT and OUTPUT
- distcp specifying specific group mapping to use for backup-source to gs:// bucket
Backup-dest user - write only access to backup location
- distcp specifying specific group mapping to use for backup-dest from gs:// bucket to s3a:// backup location
QE DataScientest - a QE role that has Read access to INPUT and Read/Write to OUTPUT
Specific service (hive) user - Read/Write access to service specific OUTPUT paths
- service user mapping to a Role with Read of OUTPUT and Read/Write to Hive
- show policy file
Hive use of IDBroker
- SQL queries against florida insurance data
- SQL queries against HDF flown tweets
Spark use of IDBroker
- simple word count of insurance data

Personas - separate tabs in terminal

systest - Data scientist - ETL (as systest)
BDR - backup source - readonly (as bdr)
lmccay - Data scientist - hive user (as hive)
sparky - Data scientist - spark user (as sparky)
nifi - Ingest - ingest user read/write to INPUT only (as nifi)

ETL Data Worker (distcp)

hadoop distcp s3a://cldr-cdp-dl-1/INPUT/tweets/summary/2019/05/10/2019-05-10T19-23-32-100-summary-tweets-168746aa-2620-4a76-9b96-10612e4799f2.orc s3a://cldr-cdp-dl-1/OUTPUT/shared/tweets.orc

spark-shell data scientest

val sc.hadoopConfiguration.set("fs.s3a.delegation.token.binding", "org.apache.knox.gateway.cloud.idbroker.s3a.IDBDelegationTokenBinding")

val sc.hadoopConfiguration.set("fs.s3a.ext.cab.address", "https://karthik-sdx-base-1.vpc.cloudera.com:8444/gateway")

val textFile = spark.read.textFile("s3a://cldr-cdp-dl-1/OUTPUT/shared/insurance/FL/FL_insurance_sample.csv")

textFile.count()

scala> val orcfile = "s3a://cldr-cdp-dl-1/INPUT/tweets/summary/2019/05/10/2019-05-10T19-23-32-100-summary-tweets-168746aa-2620-4a76-9b96-10612e4799f2.orc"
orcfile: String = s3a://cldr-cdp-dl-1/INPUT/tweets/summary/2019/05/10/2019-05-10T19-23-32-100-summary-tweets-168746aa-2620-4a76-9b96-10612e4799f2.orc

scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
warning: there was one deprecation warning; re-run with -deprecation for details
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@4a467f08

scala> val df = sqlContext.read.format("orc").load(orcfile)
19/05/11 09:17:59 WARN impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
df: org.apache.spark.sql.DataFrame = [id: bigint, id_str: string ... 4 more fields]

scala> df.show()
+-------------------+-------------------+--------------------+-------------+----+--------------------+
|                 id|             id_str|          created_at| timestamp_ms|lang|                text|
+-------------------+-------------------+--------------------+-------------+----+--------------------+
|1126930588533952512|1126930588533952512|Fri May 10 19:22:...|1557516164321|  en|RT @NSA_QIL: http...|
|1126930588257198081|1126930588257198081|Fri May 10 19:22:...|1557516164255|  en|RT @MauliConty: C...|
|1126930593751687169|1126930593751687169|Fri May 10 19:22:...|1557516165565|  en|RT @DebbieDebbieD...|
|1126930597111160832|1126930597111160832|Fri May 10 19:22:...|1557516166366|  en|[Update: RAW in N...|
|1126930589540593664|1126930589540593664|Fri May 10 19:22:...|1557516164561|  en|@bottomrung @Mich...|
|1126930599145611264|1126930599145611264|Fri May 10 19:22:...|1557516166851|  en|RT @fs0c131y: I t...|
|1126930603151122432|1126930603151122432|Fri May 10 19:22:...|1557516167806|  en|@SeanQuigley87 @P...|
|1126930604010942465|1126930604010942465|Fri May 10 19:22:...|1557516168011|  en|Newcastle-upon-Ty...|
|1126930601662210050|1126930601662210050|Fri May 10 19:22:...|1557516167451|  en|RT @jaffathecake:...|
|1126930612827279360|1126930612827279360|Fri May 10 19:22:...|1557516170113|  en|WHOA! This is hig...|
|1126930614341525504|1126930614341525504|Fri May 10 19:22:...|1557516170474|  en|@photokool6 Hi Sa...|
|1126930616740667393|1126930616740667393|Fri May 10 19:22:...|1557516171046|  en|Catch up on the b...|
|1126930615855611906|1126930615855611906|Fri May 10 19:22:...|1557516170835|  en|RT @coughetycough...|
|1126930615486550016|1126930615486550016|Fri May 10 19:22:...|1557516170747|  en|Gomorrah S.4 comi...|
|1126930614932914176|1126930614932914176|Fri May 10 19:22:...|1557516170615|  en|I see some of my ...|
|1126930617743093760|1126930617743093760|Fri May 10 19:22:...|1557516171285|  en|          Beautiful.|
|1126930622176530433|1126930622176530433|Fri May 10 19:22:...|1557516172342|  en|RT @TrulyTafakari...|
|1126930622797230080|1126930622797230080|Fri May 10 19:22:...|1557516172490|  en|RT @matthewstolle...|
|1126930623275393024|1126930623275393024|Fri May 10 19:22:...|1557516172604|  en|RT @22shtnamas: H...|
|1126930623661248513|1126930623661248513|Fri May 10 19:22:...|1557516172696|  en|RT @angryislander...|
+-------------------+-------------------+--------------------+-------------+----+--------------------+
only showing top 20 rows


scala> df.printSchema()
root
 |-- id: long (nullable = true)
 |-- id_str: string (nullable = true)
 |-- created_at: string (nullable = true)
 |-- timestamp_ms: string (nullable = true)
 |-- lang: string (nullable = true)
 |-- text: string (nullable = true)


scala> df.select("text").show()
+--------------------+                                                          
|                text|
+--------------------+
|RT @NSA_QIL: http...|
|RT @MauliConty: C...|
|RT @DebbieDebbieD...|
|[Update: RAW in N...|
|@bottomrung @Mich...|
|RT @fs0c131y: I t...|
|@SeanQuigley87 @P...|
|Newcastle-upon-Ty...|
|RT @jaffathecake:...|
|WHOA! This is hig...|
|@photokool6 Hi Sa...|
|Catch up on the b...|
|RT @coughetycough...|
|Gomorrah S.4 comi...|
|I see some of my ...|
|          Beautiful.|
|RT @TrulyTafakari...|
|RT @matthewstolle...|
|RT @22shtnamas: H...|
|RT @angryislander...|
+--------------------+
only showing top 20 rows

Hive External Table csv

[root@karthik-sdx-base-1 ~]# beeline
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.0.99-1.cdh6.0.99.p0.129/jars/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.0.99-1.cdh6.0.99.p0.129/jars/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.0.99-1.cdh6.0.99.p0.129/jars/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.0.99-1.cdh6.0.99.p0.129/jars/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Beeline version 3.1.0.6.0.99.0-129 by Apache Hive


beeline> !connect jdbc:hive2://karthik-sdx-base-1.vpc.cloudera.com:10000/default;principal=hive/[email protected]
Connecting to jdbc:hive2://karthik-sdx-base-1.vpc.cloudera.com:10000/default;principal=hive/[email protected]
Connected to: Apache Hive (version 3.1.0.6.0.99.0-129)
Driver: Hive JDBC (version 3.1.0.6.0.99.0-129)
Transaction isolation: TRANSACTION_REPEATABLE_READ

Create external table

0: jdbc:hive2://karthik-sdx-base-1.vpc.cloude> CREATE EXTERNAL TABLE FL_INS(`policyID` STRING,`statecode` STRING,`county` STRING,`eq_site_limit` STRING,`hu_site_limit` STRING,`fl_site_limit` STRING,`fr_site_limit` STRING,`tiv_2011` STRING,`tiv_2012` STRING,`eq_site_deductible` STRING,`hu_site_deductible` STRING,`fl_site_deductible` STRING,`fr_site_deductible` STRING,`point_latitude` STRING,`point_longitude` STRING,`line` STRING,`construction` STRING,`point_granularity` STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY "," LOCATION "s3a://cldr-cdp-dl-1/OUTPUT/shared/insurance/FL/" TBLPROPERTIES ("skip.header.line.count"="1");

Execute query

0: jdbc:hive2://karthik-sdx-base-1.vpc.cloude> select policyid, fl_site_deductible, line from fl_ins LIMIT 10;
INFO  : Compiling command(queryId=hive_20190511114553_7e44a7d9-9044-432a-b9b3-94954004e457): select policyid, fl_site_deductible, line from fl_ins LIMIT 10
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:policyid, type:string, comment:null), FieldSchema(name:fl_site_deductible, type:string, comment:null), FieldSchema(name:line, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20190511114553_7e44a7d9-9044-432a-b9b3-94954004e457); Time taken: 1.438 seconds
INFO  : Executing command(queryId=hive_20190511114553_7e44a7d9-9044-432a-b9b3-94954004e457): select policyid, fl_site_deductible, line from fl_ins LIMIT 10
INFO  : Completed executing command(queryId=hive_20190511114553_7e44a7d9-9044-432a-b9b3-94954004e457); Time taken: 0.763 seconds
INFO  : OK
+-----------+---------------------+--------------+
| policyid  | fl_site_deductible  |     line     |
+-----------+---------------------+--------------+
| 119736    | 0                   | Residential  |
| 448094    | 0                   | Residential  |
| 206893    | 0                   | Residential  |
| 333743    | 0                   | Residential  |
| 172534    | 0                   | Residential  |
| 785275    | 0                   | Residential  |
| 995932    | 0                   | Commercial   |
| 223488    | 0                   | Residential  |
| 433512    | 0                   | Residential  |
| 142071    | 0                   | Residential  |
+-----------+---------------------+--------------+
10 rows selected (8.23 seconds)

Hive and ORC

create external table

set hive.cli.errors.ignore=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.enforce.bucketing = true;
SET hive.exec.compress.output=true;
SET avro.output.codec=snappy;

CREATE DATABASE IF NOT EXISTS twitter;
USE twitter;

CREATE EXTERNAL TABLE IF NOT EXISTS tweetsummary_part 
(id BIGINT, id_str STRING, created_at STRING, 
timestamp_ms STRING, lang STRING, text STRING) 
PARTITIONED BY (year string, month string, day string)
STORED AS ORC
location 's3a://cldr-cdp-dl-1/INPUT/tweets/summary/';

ALTER TABLE tweetsummary_part ADD PARTITION (year='2019',month='05',day='10') 
location 's3a://cldr-cdp-dl-1/INPUT/tweets/summary/2019/05/10';

### query twitter table

0: jdbc:hive2://karthik-sdx-base-1.vpc.cloude> select id, text from tweetsummary_part LIMIT 10;
INFO  : Compiling command(queryId=hive_20190511121745_378db5aa-9b04-4374-9cd5-634525e70e79): select id, text from tweetsummary_part LIMIT 10
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:id, type:bigint, comment:null), FieldSchema(name:text, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20190511121745_378db5aa-9b04-4374-9cd5-634525e70e79); Time taken: 0.142 seconds
INFO  : Executing command(queryId=hive_20190511121745_378db5aa-9b04-4374-9cd5-634525e70e79): select id, text from tweetsummary_part LIMIT 10
INFO  : Completed executing command(queryId=hive_20190511121745_378db5aa-9b04-4374-9cd5-634525e70e79); Time taken: 0.013 seconds
INFO  : OK
+----------------------+----------------------------------------------------+
|          id          |                        text                        |
+----------------------+----------------------------------------------------+
| 1126928634839732224  | RT @the_JACKTRADO: I asked Google assistant what women want.

I had to shut down my phone cause it won't stop talking. |
| 1126928633220743169  | RT @NHSBT_RD: Today's Google Doodle Honors the life and work of haematologist Dr Lucy Wills, on what would have been her 131st Birthday. He… |
| 1126928636811055105  | .. retarded delusion .. https://t.co/JKcK3H52Kg via @YouTube .. great job and well done and the white psychotic cra… https://t.co/GNUbo5PhZI |
| 1126928641005371392  | RT @MilesTheIcon: Google when you sign in from a new device https://t.co/PBk0mAh0yB |
| 1126928642850869248  | Hey @Shopify are you seriously going to kill off your Google Shopping app and replace it with this new one that has… https://t.co/KVhMb9xAQl |
| 1126928645975678976  | RT @RealCandaceO: WOW!!! The students of the Colorado shooting WALKED OUT of the vigil saying that they would NOT allow their pain to be po… |
| 1126928648232153091  | Google is helping train military spouses for remote IT work https://t.co/EzQUENyLOh |
| 1126928650497134598  | RT @RealSUPREME_G: IT'S OFFICIAL

My New Single "Brand New Drip" Is Now Available On Spotify, Tidal, Amazon And Google Play And It's Coming… |
| 1126928650425831425  | RT @HMLoeschMcK: @AOC perhaps you should rethink your comments on the #HeartBeatLaws it is enough time! I also recommend all you abortion l… |
| 1126928654024495104  | @matthallsy Google support and tell me the definition. I do t understand when football changed and all these spoilt… https://t.co/lAzoNw0JRh |
+----------------------+----------------------------------------------------+
10 rows selected (2.054 seconds)
0: jdbc:hive2://karthik-sdx-base-1.vpc.cloude> Closing: 0: jdbc:hive2://karthik-sdx-base-1.vpc.cloudera.com:10000/default;principal=hive/[email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
policies		policies
topologies		topologies
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

idbroker-demo

Data Lake Bucket Structure

Roles to Protect Data Lake Structure

HDF ingest of twitter data

shell based users

Usecases:

Personas - separate tabs in terminal

ETL Data Worker (distcp)

spark-shell data scientest

Hive External Table csv

Create external table

Execute query

Hive and ORC

create external table

About

Releases

Packages

lmccay/idbroker-demo

Folders and files

Latest commit

History

Repository files navigation

idbroker-demo

Data Lake Bucket Structure

Roles to Protect Data Lake Structure

HDF ingest of twitter data

shell based users

Usecases:

Personas - separate tabs in terminal

ETL Data Worker (distcp)

spark-shell data scientest

Hive External Table csv

Create external table

Execute query

Hive and ORC

create external table

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages