This project was developed as the final assignment for a class in Parallel Computing at HUS. The focus of the project is on utilizing parallel computing techniques to enhance the process of web scraping. The goal is to extract large amounts of data from web pages in a shorter amount of time, by dividing the workload among multiple processes running in parallel. The use of parallel computing in web scraping has become increasingly important, as the amount of data available on the web continues to grow rapidly. This project aims to demonstrate the benefits of parallel computing for web scraping, and to provide a practical example of its implementation.
To be more specific, the project was implemented to extract job listings related to the Java programming language from the itviec.com website using web scraping techniques. The extracted data included job titles, company names, locations, and other relevant information. The use of parallel computing in this project allowed for a faster and more efficient scraping process, enabling a larger volume of data to be extracted in a shorter period of time. I also use time library to calculate the time taken to scrape all the job listings there are. All of the jobs scraped will be saved to a .csv file.
Python libraries used:
- requests
- BeautifulSoup
- multiprocessing
- time (optional)