This Python program is a bot designed to explore the pages of a website (crawling), extract the hyperlinks from each page and store them for later use. Each hyperlink on the website is tested with a request to obtain the HTTP response code (200, 404, 403, 500 etc) and to extract internal and external links from the page content (scraping). The data is next stored in a file or an SQLITE database.
To work, this program requires Python 3 and Scrapy 2.
To install Python 3, go to the official Python documentation. To install Pip, also go to the official Pip documentation. Since Python 3.4, it is included by default with the Python installer. Next in your console:
pip install Scrapy
Download WebCrawler (no pip package currently), go to the root of the project next launch the program:
start.py
Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.
Contact me for any contribution request.
WebCrawl is Open Source software distributed under a single license: GNU GPL version 3. As such, WebCrawler can freely be used, analyzed, modified and redistributed under the GNU GPL version 3 license.
Thomas Gottvalles at Tesseract IT - L'agence 100% WEB