Veeresha M Thotigar
Sai Sampath Kumar Raigiri
Sai Srinivas Vidiyala
-
Firstly we generated Twitter Access keys from the developers.twitters.com using our twitter accounts.
-
Using tweepy package in python we downloaded data on topic yoga, medetation, etc.,
-
Writing python code again we extracted url and hashtags for downladed tweets and the output is our translated input.
-
We loaded traslated input to hdfs directory using "HDFS DFS -copyFromLocal source path HDFS destination path" command in the terminal.
-
We used the example word count program which is part of haddop installation and produced word count for the large data.
-
simillarly we executed spark word count job for the same input data to process the data. and the out put is in the folder
We Pushed our hadoop log files , output, commands that we used in the terminal in the form of "steps_hadoop_wordcount.txt".
Hadoop folder conains the output and logs generated for the hadoop word count program.
tweetcrawler.py file in folder Python_script contains python code for downloading tweets using keys and saving them into a csv file.
Extract.py file in folder Python_script contains python code for extracting URLs and HashTags from the downloaded tweets into a text file.
Spark folder contains the output and logs of the job submitted.