Big Data Service stack

This terraform scripts allow to user provision stack of OCI resources needed for Big Data Service, including service itself:

Compartment where all resources will be provisioned
Couple of Edge Nodes
- Add these edge nodes into Cloudera manager and deploy cluster config on it
- Kerberos principal "opc" and add keytab file on Edge nodes (password is the same as Cloudera Manager)
- Create Load Balancer infront of edge nodes
Big Data Service
BDS admin user/password and group (in OCI)
Security Policies required by:
- Big Data Service
- Oracle Function
- Data Catalog
- Data Integration Service
- Add edge nodes into Dynamic Group and allow to manage all resources in the demo compartment
Network artifacts:
- VCN
- Subnet
- Service Gateways
- NAT gateway
- Internet gateway
- Security lists with appropriate settings
- Routing tables
- DHCP
Public IP (and assign it to Cloudera Manager host)
Pull zeppelin image on the Cloudera Manager host
Create Application and "Hello World" function
Create API gateway
Create Data Catalog Instance
Create Data Integration workspace
Scripts to download test data sets

In order to provision it, user have to follow steps:

Provision some client compute instance in OCI (this instance can be removed after terraform scrips finish it work). There are multiple ways, for example:

Using OCI Web UI (reccomended for begginers)
Using OCI CLI (good for scripting)

as part of provisioning you may need to create key pair. Detailed information you can find here, to keep it short and match with env-vars.sh settings, just run:

sudo ssh-keygen -t rsa -N "" -b 2048 -C demoBDSkey -f userdata/demoBDSkey

Note: to keep it simple I suggest to use same kaypair for BDS cluster and edge node

create Dynamic Group and put this host into this Dynamic Group. More details on how to do this you can find here

after host provisioned, ssh to this host, like this:

ssh -i myPrivateKey opc@<ip address>

Note: you can search for host PublicIP on the host page:

Install git and terraform:

$ sudo yum install -y git terraform

clone terrform repository:

$ git clone https://github.com/filanovskiy/terraform-oci-bds.git

go to repository dir and init terraform provider:

$ cd terraform-oci-bds

$ terraform init

after this user have to fill up enviroment varibles in env-vars.sh

Name of the varible	Description	Comments
TF_VAR_tenancy_ocid	Tenancy OCID	Have to be updated
TF_VAR_compartment_name	Name of the compartment, that will be created	Can leave as is
TF_VAR_home_region	Home region	Have to be updated
TF_VAR_region	Region where stack will be provisioned	Can leave as is
TF_VAR_bds_instance_cluster_admin_password	Cloudera Manager admin password	It's better to update
TF_VAR_ssh_keys_prefix	Prefix of ssh-rsa kay pair	Can leave as is (don't forget to generate keys)
TF_VAR_ssh_public_key	Path to public key	Can leave as is (don't forget to generate keys)
TF_VAR_ssh_private_key	Path to private key	Can leave as is (don't forget to generate keys)
TF_VAR_bds_cluster_name	Big Data Service cluster name	Can leave as is

To obtain tenancy go to the OCI Web UI and click on the user icon in the up right corner and there choose tenancy:

At this page you will need to obtain "TF_VAR_tenancy_ocid" and "TF_VAR_home_region" values

Note: you may want to generate ssh key pair. You may simply run this command to match env-vars.sh config:

$ sudo ssh-keygen -t rsa -N "" -b 2048 -C demoBDSkey -f userdata/demoBDSkey

$ sudo chown opc:opc userdata/demoBDSkey*

apply this enviroment varibles:

$ source env-vars.sh

Run provisioning:

$ terraform apply -auto-approve

After script finished, user will see output, containing:

Edge node IPs
Compartment OCID
BDS Admin username
BDS Admin one time password (you have to change it right after login)
Load balancer (balancing edge nodes) IP
Cloudera Manager Public IP

Example:

bds_admin_usr_one_time_password = r}HZkIp9M
cm_public_ip = 132.145.147.5
compartment_OCID = ocid1.compartment.oc1…aaaaaaa…qfeq
edge_node_ip = [
129.213.133.80,
193.122.136.216,
]
lb_public_ip = 193.122.133.54
resource_compartment_name = bds-tf-demo
user_name = bds_admin_usr

to ssh to edge node, run: ssh -i userdata/demoBDSkey opc@<edge ip address>

In case you want to deploy client roles on the edge nodes (reccomended) you have to run follow script:

ssh -i userdata/demoBDSkey opc@<edge ip address> /home/opc/add-to-cm.sh

For login from Edge node to utility node run:

ssh -i .ssh/bdsKey opc@$CM_IP

In case you want to generate some test datasets, you have to loging into Cloudera Manager node: ssh -i .ssh/bdsKey opc@$CM_IP after this run the script: [opc@bdsdemoun0 ~]$ /home/opc/generate_tpcds_data.sh

after script done you can check datasets on HDFS:

$ hadoop fs -ls /tmp/tpcds/text
Found 27 items
drwxr-xr-x - opc supergroup 0 2020-09-22 03:03 /tmp/tpcds/text/call_center
drwxr-xr-x - opc supergroup 0 2020-09-22 03:03 /tmp/tpcds/text/catalog_page
drwxr-xr-x - opc supergroup 0 2020-09-22 03:00 /tmp/tpcds/text/catalog_returns
…
drwxr-xr-x - opc supergroup 0 2020-09-22 03:03 /tmp/tpcds/text/web_site

Hive:

$ hive -e “show tables” --database tpcds_csv
…
customers
date_dim
…

also this dataset copied into Object Store:

Alternatively you can download into bucket NYC bike trips and weather information. In order to accomplish this, go to one of the edge node and run:

[opc@bds-demo-egde0 ~]$ ./downloadbikes.sh

after command is done, you can check that data been appeared into Object Store bucket:

Second dataset available for downloading is weather data. In order to upload it just run:

[opc@bds-demo-egde0 ~]$ ./downloadweather.sh

after command is done, you can check that data been appeared into Object Store bucket:

If you want harvest data from one of this bucket into Data Catalog, you will need to create some configuration (register Object Store data Asset and create connection):

[opc@bds-demo-egde0 ~]$ dcat/dcat_stack.sh

after one time configuration is done, run harvesting Job against some bucket:

[opc@bds-demo-egde0 ~]$ dcat/dcat_harvest.sh bikes_download

You can verify results of the job into OCI Data Catalog console:

If you want to run some transformations with your data (convert from csv format to parquet for example), you may use Data Integration Service. First you need to register Data Asset (Object Store). Simple way to do so is run the script:

[opc@bds-demo-egde0 ~]$ dis/dis_crete_da.sh

The next script will create Data Flow for weather data in case you downloaded it in previous step:

[opc@bds-demo-egde0 dis]$ dis/create_df_weather.sh

After script done you can check in OCI UI Created Data Flow:

Name		Name	Last commit message	Last commit date
Latest commit History 411 Commits
compute		compute
images		images
userdata		userdata
vcn		vcn
README.md		README.md
api.tf		api.tf
datacatalog.tf		datacatalog.tf
dis.tf		dis.tf
env-vars.sh		env-vars.sh
iam.tf		iam.tf
lb.tf		lb.tf
main.tf		main.tf
object_store.tf		object_store.tf
outputs.tf		outputs.tf
policies.tf		policies.tf
provider.tf		provider.tf
varibles.tf		varibles.tf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big Data Service stack

About

Releases

Packages

Languages

filanovskiy/terraform-oci-bds

Folders and files

Latest commit

History

Repository files navigation

Big Data Service stack

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages