The second of a three part case study in reading a big (21GB) text file using C, Python, PYSPARK and Spark-Scala. This part deals with using Pyspark for our processing
We try to read the same big file (21 Gbytes) we read before with python but this time using Spark. It won't be a true test as we are only running this on my local PC not on a proper cluster. Just thought it would be interesting to try it out. Just to recap, the data file is about 21 Gigabtyes long and holds approximately 366 Million pipe separated records. The first 10 records are shown below:
The second field in the above file can range between 1 and 56 and the goal was to split up the original file so that all the records with the same value for the second field would be grouped together in the same file. i.e we would end up with 56 separate files, period1.txt, period2.txt ... period56.txt each containing approximately 6 million records.
I ran this on a Windows 7 PC with 16Gbytes of ram using python version 3.5, pyspark 2.1 and a Jupyter notebook. I used the same "big file" as was used in my other repository - read-big-file-with-python.
The job took 37 minutes to complete but bear in mind there would still have to be a bit of post processing to be done to collect all the disparate files together. This compares with the 18 minutes it took to process the same file using just python 3.6 on the same PC and the 54 minutes it took a C program to process it on an HP Alpha box.
Over time this case study morphed into 5 parts. You can read the others here: