Skip to content

This Python program is a bot designed to explore the pages of a website (crawling), extract the hyperlinks from each page and store them for later use. Each hyperlink on the website is tested with a request to obtain the HTTP response code (200, 404, 403, 500 etc) and to extract internal and external links from the page content (scraping).

Notifications You must be signed in to change notification settings

thomasgottvalles/WebCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

WebCrawler

This Python program is a bot designed to explore the pages of a website (crawling), extract the hyperlinks from each page and store them for later use. Each hyperlink on the website is tested with a request to obtain the HTTP response code (200, 404, 403, 500 etc) and to extract internal and external links from the page content (scraping). The data is next stored in a file or an SQLITE database.

🛠️ Installation and execution

To work, this program requires Python 3 and Scrapy 2.

To install Python 3, go to the official Python documentation. To install Pip, also go to the official Pip documentation. Since Python 3.4, it is included by default with the Python installer. Next in your console:

pip install Scrapy

Download WebCrawler (no pip package currently), go to the root of the project next launch the program:

start.py

🍰 Contribution

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.

Contact me for any contribution request.

📝 Licence

WebCrawl is Open Source software distributed under a single license: GNU GPL version 3. As such, WebCrawler can freely be used, analyzed, modified and redistributed under the GNU GPL version 3 license.

🧑 Author

Thomas Gottvalles at Tesseract IT - L'agence 100% WEB

🏄 Have fun

About

This Python program is a bot designed to explore the pages of a website (crawling), extract the hyperlinks from each page and store them for later use. Each hyperlink on the website is tested with a request to obtain the HTTP response code (200, 404, 403, 500 etc) and to extract internal and external links from the page content (scraping).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages