Skip to content
/ gocrawl Public
forked from PuerkitoBio/gocrawl

Polite, slim and concurrent web crawler.

Notifications You must be signed in to change notification settings

uriel/gocrawl

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gocrawl

gocrawl is a polite, slim and concurrent web crawler (or spider).

It is currently in a very early stage of active development, unfit for any serious use. It will kill kittens. Don't. Not yet.

Features

  • Configurable URLs to visit, inspect and query (using a pre-initialized goquery document)
  • Crawl delays applied per host
  • Obedience to robots.txt rules (using robotstxt go library)
  • Configurable concurrency
  • Loggable using the builtin configurable Go logger
  • ...and possibly more, we'll see!

gocrawl does not attempt to detect staleness of a page, nor does it implement a caching mechanism. If an URL is enqueued to be processed, it will make a request to fetch it. However, it provides some hooks where some URL analysis can take place to alter its behaviour (based on the last visited date of the URL, saved in a persistent store, for example). Instead of trying to do everything and impose a way to do it, it offers ways to manipulate and adapt it to anyone's needs (hopefully!).

Likewise, there is no prioritization among the URLs to process. It assumes that all enqueued URLs must be visited at some point, and that the order in which they are is unimportant.

By default, uses the default net/http.Client. This default will automatically follow redirects up to 10 times (see the net/http doc for Client struct). It will be possible to provide a custom Fetcher interface implementation.

About

Polite, slim and concurrent web crawler.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published