WebCrawler

This Python program is a bot designed to explore the pages of a website (crawling), extract the hyperlinks from each page and store them for later use. Each hyperlink on the website is tested with a request to obtain the HTTP response code (200, 404, 403, 500 etc) and to extract internal and external links from the page content (scraping). The data is next stored in a file or an SQLITE database.

🛠️ Installation and execution

To work, this program requires Python 3 and Scrapy 2.

To install Python 3, go to the official Python documentation. To install Pip, also go to the official Pip documentation. Since Python 3.4, it is included by default with the Python installer. Next in your console:

pip install Scrapy

Download WebCrawler (no pip package currently), go to the root of the project next launch the program:

start.py

🍰 Contribution

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.

Contact me for any contribution request.

📝 Licence

WebCrawl is Open Source software distributed under a single license: GNU GPL version 3. As such, WebCrawler can freely be used, analyzed, modified and redistributed under the GNU GPL version 3 license.

🧑 Author

Thomas Gottvalles at Tesseract IT - L'agence 100% WEB

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

WebCrawler

🛠️ Installation and execution

🍰 Contribution

📝 Licence

🧑 Author

🏄 Have fun

Files

README.md

Latest commit

History

README.md

File metadata and controls

WebCrawler

🛠️ Installation and execution

🍰 Contribution

📝 Licence

🧑 Author

🏄 Have fun