Do you work with large amount of data? Are you curious how to analyze them effectively? If so, its time to start using Spark! This workshop will familiarize you with basic concepts of Spark and the PySpark library. You will learn about RDDs (basic data containers) and how to work with them using the map-reduce concept, how to transform the data and what the difference between transformations and actions is. You will practice solving concrete problems. Finally, you will learn how to visualize and evaluate your solution.
The aim of this workshop is to familiarize participants with basic concepts of Spark and the PySpark library. Workshop is divided into two parts (as specified in outline). The first part consists of introduction to Spark concept, RDDs and most important actions, transformations and aggregation functions. Second part with additional excercises will help you
Part I
- Setup
- What is PySpark and why to use it?
- DataFrames and RDDs
- Actions
- Transformations
- Caching DataFrames
- Debugging
Part II
-
Basic exercises
-
I/O
-
Advanced exercises - word counting