Skip to content

tendulkaramey/Data-Hub

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#Steps:

    • Generate data files from python scripts.
    • Use pig scripts to process the data and extract important information in structured format and store it in txt files,upload the files to HDFS.
    • Store the processed data in structured format in hive tables.
    • Use hive queries to extract valuable insights from the data such as top 5 products purchased, distribution of payment methods etc.
    • Store the hive queries data into mysql database.
    • Connect django with mysql database to get the valuable insights and show analytics on user interface.
    • Repeat the above steps periodically using cron job/scheduler

#Step 1:

Scripts to create data files are stored in datascripts folder. names of py files are:

  1. advertiselog.py
  2. productlog.py
  3. serverlog.py

you can create new log files by tunning the script or use the already created files from datascript folder. names of log files are:

  1. advertise.log
  2. server.log
  3. product.log

#Step 2: set the load path of data files in pigscripts based on your file systems and run the scripts, output will be stored in pigoutputs folder. names of pig scripts are:

  1. pigproduct.pig
  2. advertise.pig
  3. serverlogs.pig

and then upload the pig output files to hdfs into:

  1. /ecommercedata/logindata/
  2. /ecommercedata/cartdata/
  3. /ecommercedata/purchasedata/
  4. /ecommercedata/pviewdata/
  5. /ecommercedata/advertisedata/
  6. /ecommercedata/serverdata/

based on how many files pig generates, you need to add those many to hdfs.

#Step 3:

now first run the initialisehive.hiveql which will create database and table into your system. then run hiveinsert.hiveql which will insert pig output data to hive tables.

#Step 4:

now run the hive queries and store the output into folder called hiveoutputs.

queries and their output filename are:

  1. adv1.hiveql --> a1.txt
  2. adv2.hiveql --> a2.txt
  3. adv3.hiveql --> a3.txt
  4. productquery.hiveql --> p1.txt
  5. productquery1.hiveql --> p2.txt
  6. productquery2.hiveql --> p3.txt
  7. productquery3.hiveql --> p4.txt
  8. productquery4.hiveql --> p5.txt
  9. ser1.hiveql --> s1.txt
  10. ser2.hiveql --> s2.txt
  11. ser3.hiveql --> s3.txt

#Step 5:

now run the python script called pythonmysql.py which will collect hive output and store in mysqldatabase. you would need to setup the database connection mentioned in script based on parameters in your system.

#Step 6:

  • install django,python
  • django project is in djangohive
  • run the following commands:
  • python manage.py migrate which will create tables into database.
  • python manage.py runserver which will start the server
  • now go to localhost:8000/dashboard which is starting page of application.

#Step 7: you can setup cron jobs/schedulers to automate the above steps in your system.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published