WebCrawler

This Python program is a bot designed to explore the pages of a website (crawling), extract the hyperlinks from each page and store them for later use. Each hyperlink on the website is tested with a request to obtain the HTTP response code (200, 404, 403, 500 etc) and to extract internal and external links from the page content (scraping). The data is next stored in a file or an SQLITE database.

🛠️ Installation and execution

To work, this program requires Python 3 and Scrapy 2.

To install Python 3, go to the official Python documentation. To install Pip, also go to the official Pip documentation. Since Python 3.4, it is included by default with the Python installer. Next in your console:

pip install Scrapy

Download WebCrawler (no pip package currently), go to the root of the project next launch the program:

start.py

🍰 Contribution

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.

Contact me for any contribution request.

📝 Licence

WebCrawl is Open Source software distributed under a single license: GNU GPL version 3. As such, WebCrawler can freely be used, analyzed, modified and redistributed under the GNU GPL version 3 license.

🧑 Author

Thomas Gottvalles at Tesseract IT - L'agence 100% WEB

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
WebCrawler		WebCrawler
README.md		README.md
scrapy.cfg		scrapy.cfg
start.py		start.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebCrawler

🛠️ Installation and execution

🍰 Contribution

📝 Licence

🧑 Author

🏄 Have fun

About

Releases

Packages

Languages

thomasgottvalles/WebCrawler

Folders and files

Latest commit

History

Repository files navigation

WebCrawler

🛠️ Installation and execution

🍰 Contribution

📝 Licence

🧑 Author

🏄 Have fun

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages