Skip to content

Latest commit

 

History

History
38 lines (23 loc) · 1.62 KB

README.md

File metadata and controls

38 lines (23 loc) · 1.62 KB

WebCrawler

This Python program is a bot designed to explore the pages of a website (crawling), extract the hyperlinks from each page and store them for later use. Each hyperlink on the website is tested with a request to obtain the HTTP response code (200, 404, 403, 500 etc) and to extract internal and external links from the page content (scraping). The data is next stored in a file or an SQLITE database.

🛠️ Installation and execution

To work, this program requires Python 3 and Scrapy 2.

To install Python 3, go to the official Python documentation. To install Pip, also go to the official Pip documentation. Since Python 3.4, it is included by default with the Python installer. Next in your console:

pip install Scrapy

Download WebCrawler (no pip package currently), go to the root of the project next launch the program:

start.py

🍰 Contribution

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.

Contact me for any contribution request.

📝 Licence

WebCrawl is Open Source software distributed under a single license: GNU GPL version 3. As such, WebCrawler can freely be used, analyzed, modified and redistributed under the GNU GPL version 3 license.

🧑 Author

Thomas Gottvalles at Tesseract IT - L'agence 100% WEB

🏄 Have fun